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Abstract 


Improvisation is a form of live theatre where artists perform real-time, dynamic 
problem solving to collaboratively generate interesting narratives. The main 
contribution of this thesis is the development of artificial improvisation: im- 
provised theatre performed by humans alongside intelligent machines. First, 
I present background underlying the art of improvisation and the scientific 
fields of interactive machine learning and dialogue generation. Then, I present 
Pyggy, the first experiment on live stage human-machine improvisation and 
A.L.Ex., the Artificial Language Experiment which addresses several key tech- 
nical improvements over Pyggy. Improbotics is then presented which details 
audience evaluation of Turing test-inspired live improvised performance using 
A.L.Ex. Two novel contributions to machine-assisted narrative generation are 
then presented and discussed. The first of these contributions, Shaping the 
Narrative Arc, is a model incorporating an underlying narrative arc to im- 
prove response generation. The second contribution, dAJrector, synthesizes 
a plot graph with contextual information to generate contextual plot points 
and serve as director. The thesis concludes by discussing public reflections on 
live artificial improvisation performances from around the world and interest- 
ing future directions to explore. My work presents fundamental advances in 
human-machine interaction through the lens of improvised theatre which is 


the ideal test bed for collaboration between humans and intelligent machines. 
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Preface 


This thesis is an original work by Kory Wallace Mathewson.! Several chapters 
of this thesis are based on published or submitted papers written in collabo- 
ration with other researchers. For each chapter contained herein, details on 
these publications are included as footnotes. Submission and author contribu- 
tion details are summarized below for completeness. Unless otherwise stated, 
no parts of this thesis have been previously published. Several of the studies 
included in this thesis are part of the research research project which received 
research ethics approval from the University of Alberta Research Ethics Board. 
Project Name: Developing an Engaging Artificial Intelligence for Conversa- 
tional Performance Art, ID Number: Pro00079215, Date: February, 23 2018. 
Mayank Rehani and Dr. Patrick Pilarski provided support compiling and sub- 
mitting the ethics application. 

The background and literature review in Chapters 1 and 2 as well as the 
discussion and conclusions in Chapters 7 and 8 are original works composed 
by Kory Wallace Mathewson and are previously unpublished. The only excep- 
tion is that some of the initial conceptualization and progress as described in 
Chapter 3 is published on Kory Wallace Mathewson’s website as: Mathewson 
KW (2016) Building an Artificial Improvisor.? 

Chapter 3 is adapted from a paper submitted and accepted at the 13th 
AAAI Conference on Artificial Intelligence and Interactive Digital Entertain- 
ment (AIIDE 2017). The original technical apparatus (Pyggy) described in 
Chapter 3 was designed and built by Kory Wallace Mathewson. The second 


iteration of the technical apparatus (A.Z.Ez.) was conceived of by Dr. Pi- 





'Through the thesis the narrator will prefer the plural voice over the singular. 
*https: //korymathewson.com/building-an-artificial-improvisor/. 
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otr Mirowski (HumanMachine), and designed and built for this work through 
collaboration between Kory Wallce Mathewson and Dr. Mirowski. The ex- 
perimental design, data collection, and concluding analysis are the authors 
shared original work. It is published as: Mathewson KW and Mirowski P. 
(2017) Improvised Theatre Alongside Artificial Intelligences. 13th 
AAAI Conference on Artificial Intelligence and Interactive Digital Entertain- 
ment (AIIDE’17). Snowbird, Utah, United States. October 2017. Addition- 
ally, a preliminary version of this work was presented as a workshop extended 
abstract and poster: Mathewson KW and Mirowski P. (2017) Artificial Im- 
provisation: Improvisational Theatre with Deep Neural Networks 
and Robots. 2017 Multidisciplinary Conference on Reinforcement Learning 
and Decision Making. Ann Arbor, MI, United States, 2017. 

Chapter 4 is adapted from a paper accepted to the 14th AAAI Confer- 
ence on Artificial Intelligence and Interactive Digital Entertainment (AIIDE 
2018) and a workshop paper submitted and accepted at the 2017 Workshop 
on Machine Learning for Creativity and Design at the 2017 Conference and 
Workshop on Neural Information Processing Systems (NIPS) conference. The 
technical apparatus (A.L.Ex.) was designed and built in collaboration with 
Dr. Piotr Mirowski (HumanMachine), as described above. The experimental 
design was developed in collaboration with Dr. Mirowski. The data analysis 
and concluding analysis are the authors shared original work. It is published 
as Mathewson KW and Mirowski P. (2018) Improbotics: Exploring the 
Imitation Game using Machine Intelligence in Improvised Theatre. 
14th AAAI Conference on Artificial Intelligence and Interactive Digital Enter- 
tainment (AIIDE’18). Edmonton, Alberta, Canada. November 2018. Addi- 
tionally, a preliminary version of this line of experimentation was presented as: 
Mathewson KW and Mirowski P. (2017) Improvised Theatre as a Turing 
Test. NIPS 2017 Workshop on Machine Learning for Creativity and Design. 

Chapter 5 is adapted from a paper entitled Shaping the Narrative Arc: 
An Information-Theoretic Approach to Collaborative Story Gener- 
ation through Dialogue. The technical apparatus was designed and built 
by Kory Wallace Mathewson in collaboration with Pablo Castro and Marc G. 
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Bellemare. Pablo Castro assisted with the data collection and contributed to 
manuscript edits. Colin Cherry and George Foster provided valuable concep- 
tual insights and manuscript edits. Marc G. Bellemare was the supervisory 
author and was involved with concept formation, experimental design, and 
manuscript composition. 

Chapter 6 is adapted from Eger M. and Mathewson KW. (2018) dAIrec- 
tor: Automatic Story Arc Generation through Knowledge Synthe- 
sis. Joint Intelligent Narrative Technologies / Intelligent Cinematography and 
Editing Workshop at AAAI Conference on Artificial Intelligence and Interac- 
tive Digital Entertainment (AIIDE). November 2018. Edmonton, Alberta, 
Canada. This work is a collaborative effort with Markus Eger (NC State Uni- 
versity). Both authors collaborated on the conceptualization, development of 
the experimental apparatus, data collection, manuscript preparation. 

I include references for 11 other research contributions supporting human- 
machine interaction that I collaborated on: 

e Dziri N, Kamalloo E, Mathewson KW and Zaiane O (2019) Evaluating 
Coherence in Dialogue Systems using Entailment. Annual Conference 
of the North American Chapter of the Association for Computational 
Linguists (NAACL-HLT). Minneapolis, USA. 

e Winters T and Mathewson KW (2019) Automatically Generating 
Engaging Presentation Slide Decks. 8th International Conference on 
Computational Intelligence in Music, Sound, Art and Design (Evo- 
MUSART’19). Leipzig, Germany. 

e Dziri N, Kamalloo E, Mathewson KW and Zaiane O (2018) Aug- 
menting Neural Response Generation with Context-Aware Topical At- 
tention. arXiv preprint arXiv:1811.01063. 

e Travnik JB, Mathewson KW, Sutton RS and Pilarski PM (2018) Reac- 
tive Reinforcement Learning in Asynchronous Environments. Frontiers 
in Robotics and AI, Vol 5, p.79. 

e Pilarski PM, Sutton RS, Mathewson KW, Sherstan C, Parker AS and 
Edwards AL (2017) Communicative Capital for Prosthetic Agents. arXiv 
preprint arXiv:1711.03676. 


Mathewson KW and Pilarski PM (2017) Concurrent Human Control 
and Feedback Shaping for Robot Training with Actor-Critic Reinforce- 
ment Learning. Multidisciplinary Conference on Reinforcement Learning 
and Decision Making (RLDM). Ann Arbor, USA. 

Mathewson KW and Pilarski PM (2017) Reinforcement Learning 
based Embodied Agents Modelling Human Users Through Interaction 
and Multi-Sensory Perception. 2017 AAAI Spring Symposium on In- 
teractive Multi-Sensory Object Perception for Embodied Agents. Palo 
Alto, USA. 

Vega R, Sajed T, Mathewson KW, Khare K, Pilarski PM, Greiner R, 
Sanchez-Ante G, Antelis JM (2017) Assessment of feature selection and 
classification methods for recognizing motor imagery tasks from elec- 
troencephalographic signals. Artificial Intelligence Research, Vol 6:1. 
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A whole is that which has a beginning, a middle, and an end. A beginning 
is that which does not itself follow anything by causal necessity, but after which 
something naturally is or comes to be. An end, on the contrary, is that which 
itself naturally follows some other thing, either by necessity, or as a rule, but 
has nothing following it. A middle is that which follows something as some 
other thing follows it. A well constructed plot, therefore, must neither begin 


nor end at haphazard, but conform to these principles. 


— Aristotle in Poetics (Part VII), c. 335 BC. 


Supposing, for instance, that the fundamental relations of pitched sounds 
in the science of harmony and of musical composition were susceptible of such 
expression and adaptations, the engine might compose elaborate and scientific 


pieces of music of any degree of complexity or extent... 


— Ada Augusta, Countess of Lovelace, notes on L. F. Menabrea’s Sketch of 
the Analytical Engine Invented by Charles Babbage, Scientific Memoirs, 1843. 


The script places constraints on dialogue and interaction, and it defines 
concise test scenarios. The stage constrains the environment, especially if it 
is equipped with special sensing, communication or computational infrastruc- 
ture. More importantly, the intelligent stage, with its embedded computing 
and sensing systems, is a resource that autonomous robotic performers could 
use to bolster their own ability to perceive and interact with people within the 


environment. 


— Cynthia Breazeal, MIT Media Lab in Interactive Robot Theatre, 2003. 
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Chapter 1 


Introduction 


A good improvisor looks great on stage; 


the best improvisors make everyone else look great. 


— Jacob Banigan, 2004 


Our uniquely human experience is defined by knowledge sharing and sto- 
rytelling. The main means by which we share our experiences is through 
spoken dialog. Spoken dialogue has evolved to become one of the most com- 
plex functions of human communication. We can engage in multiple conver- 
sations over a short period of time, quickly adapting to our conversational 
partners. Through these conversations we share our perspective, seek advice, 
share stories, and find common ground. These interactions collectively define 
the human condition. 

Many commonplace technologies place humans in contact with computer- 
ized dialogue systems—computer systems that interpret and generate speech 
(e.g., cell phones, home assistants). These systems are limited in their ca- 
pacity to continue engaging conversations with humans. To help address this 
limitation, dialogue systems may be improved through the ability to adapt to 
different speakers, contexts, and modes of operation. Rapid adaptation is a 
unique characteristic of human dialog. 

A major domain of study for dialogue research is that of curated dialogue 
in cinematic and theatre scripts. This script research is constrained, as the 
data is meticulously authored and edited by its very nature. Rather than 


using curated, written dialog, we propose improvised theatre dialogue as an 
1 


exceptional domain for experimentation toward improving dialogue generation 
systems. Improvised theatre is characterized by adaptive performers sponta- 
neously responding to each other, and the environment, to collaboratively 
create a narrative. 

This thesis aims to test and evaluate novel dialogue system improvements 
by through experimentation with improvisational theatre performers and au- 
diences. The hypothesis is that these improvements will allow for systems to 
continue longer, more engaging interactions than current hand-coded perfor- 
mance systems. Further to this aim, this thesis aims to investigate interactions 
between humans, as participants and observers, and these systems. These in- 
teractions are surveyed under conditions when human believe they are inter- 
acting with, or watching another human interact with, an artificial intelligence 
rather than a human, and vice-versa. 

Several researchers have designed experiments to understand the interac- 
tions between humans and artificial-intelligence-powered dialogue systems. To 
date, these systems have not been investigated by expert collaborative story- 
telling humans. This thesis accomplishes this testing by recruiting improvi- 
sational theatre artists and audiences to engage with the system and provide 
quantitative and qualitative feedback. 

Understanding the interaction between humans and artificial-intelligence- 
based dialogue systems offers enormous potential to significantly improve how 
humans express themselves and share their experiences. The findings herein 
offer significant insights into interactions when humans and machines perform 
together on the theatrical stage. With experimentation and future research, 
similar configurations could be used in a wide range of human-machine collab- 
orative tasks; this general impact should not be understated. We interact with 
technology, and specifically machine-learning systems, at an increasing rate in 
our every day lives. By studying how these interactions occur, from both sides 
of the collaboration, we can make the entire experience safer, easier, and more 
effective 

Artistically, the main contribution of this thesis is the formal development 


and dissemination of the art-form of “artificial improvisation”. Artificial im- 
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provisation is improvisational theatre performed by humans along- 
side intelligent machines. The research and development in this thesis has 
been used in over 75 shows, by 5 professional theatre companies around the 
world. This work has ignited a community to technology-augmented improvi- 
sational theatre performers to present and share their work. 

This dissertation contains four contributions supporting the first appli- 
cation of interactive machine learning-based dialogue systems for improvised 
theatre. The contributions are separated into chapters. Each chapter presents 
progress in open-domain dialogue research in improvised theatre. 

e Chapter 3 presents the first formalization of human-machine improvised 
theatre experiment as a suitable, challenging test bed for interactive 
open-domain dialogue research. It presents the development and evalua- 
tion of two artificial improvisors: 1) Pyggy, a rule-and-retrieval-based 
conversational system, and 2) A.L.Ex. (Artificial Language Experi- 
ment), a generative language modelling-based conversational system. 

e Chapter 4 presents evaluation of artificial improvisors, contributing the 
first quantitative and qualitative analysis of artificial improvisation. It 
presents results and analysis of the systems and experiments introduced 
in Chapter 3. 

e Chapter 5 contributes a novel method extending artificial improvisation. 
“Shaping the Narrative Arc” is a method which models the underlying 
arc of conversation. Augmenting conversation systems with a notion 
of story arc enhances narrative generation. Results from surveys with 
expert human improvisors are presented in support of the methods ca- 
pabilities. 

e Chapter 6 contributes the “dAlIrector”: a novel method for generating 
coherent, contextual plot structures by fusing information contained in 
two knowledge graphs. 

Chapters 3 and 4 present development and evaluation of complete artificial 

improvisor from a systems approach. Chapters 5 and 6 each present a novel 
method augment any artificial improvisor. The methods for “Shaping the Nar- 


rative Arc” and “dAlIrector” are system agnostic, and evaluated in isolation, 
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disconnected from the artificial improvisors in the earlier chapters. 

In addition to these science-directed efforts, this dissertation has made ad- 
ditional contributions toward the artistic goal of integrating machine learning- 
based dialogue systems into theatrical productions. Artificial improvisation 
is a new interactive multimedia art-form which signifies a novel innovation in 
field of improvisation as supported by reflections from critics, public media, 
expert performers, and audiences (Section 7.2). Artists describe the system as 
inspiring, engaging, challenging, and that it provides them a new medium to 
enhance and augment their own artistic creations. 

The thesis exists at the intersection between three worlds of research— 
interactive machine learning, dialogue systems, and improvisational theatre. 
In the Chapter 2, we present background on these fields with historical con- 
text and supporting literature. We introduce topics and methods, and discuss 
how the fields braid together and intertwine and support each other. It is 
our hope that all background materials necessary to understand the experi- 
ments through this thesis are contained in this chapter. The contributions in 
Chapters 3-7 are detailed previously in this section. Chapter 7 discusses the 
scientific and artistic innovations of artificial improvisation, and explores how 
this work has progressed the frontier, and presents several research directions 
for future endeavour. Finally, Chapter 8 provides concluding remarks and 


hope for the future. 


Chapter 2 


Background 


We are blessed with a language, a system of communication, so powerful and 
so precise that we can share what we’ve learned with such precision that it 
can accumulate in the collective memory. And that means it can outlast the 
individuals who learned that information, and it can accumulate from 
generation to generation. And that’s why, as a species, we’re so creative and 
so powerful, and that’s why we have a history. We seem to be the only 


species in four billion years to have this gift. 


— David Christian 
TED 2011 


This thesis focuses on advancing interactive machine learning for impro- 
vised theatre. Section 2.1 synthesizes research in interactive machine learning 
(IML), Section 2.2 provides relevant background literature for appreciating 
and understanding the conversational dialogue system research, and Section 
2.3 provides background on the artistic field of improvised theatre. This the- 
sis is positioned at the intersection between these three fields. The chapters 
contain new methods and analysis which represent our contributions to this 
intersection by connecting ideas, methods, and research from IML, dialogue 
systems, and improvised theatre. At a high level, these fields connect at the 


intersection of continual learning through interaction, trial and error. 


2.1 Interactive Machine Learning 


This section presents and synthesizes research and applications allowing a 
human with no assumed application-specific knowledge or technical ability, to 
interact with machine learning (ML) models. Interactive Machine Learning 
(IML) research techniques and approaches are outlined and applications of 
IML research in creative applications and artistic presentations are presented. 
This section concludes with promising prospects for future development. 

Developments in IML combine the unique talents of humans and machines 
to produce new capabilities. Human-machine teams are prevalent in the lives 
of many humans. We live in an interactive world; examples include: bicycles 
for transportation, medical prosthetic devices, web search engines, voice-based 
assistants, and connected home technology. Many of these systems already 
incorporate ML. For instance, many systems can learn to personalized perfor- 
mance based on unique individual interaction. 

Fundamentally, IML systems predict appropriate responses to interactions. 
That is, a human provides an input and a system provides a response. This 


model of the interaction can be generalized by the motivating question: 
“When the human does this, what should the machine do?” 


We interact with the expectation that the system will be able to translate our 
inputs into meaningful outputs. While we may understand the limitations of 
these systems, we may also overestimate the capabilities. This overestimation 
can lead to fear. Artificial intelligence and robots have been presented as dan- 
gerous platforms and used to mislead the public.! Naivety can also transform 
into understanding, appreciation, and excitement for the future. This requires 
rigorous science devoted to effective knowledge transfer; science which is both 
fundamental research and clear communication. 

ML is a field of computing science which explores automated model learn- 
ing. Learning is the improvement in performance on a certain task with respect 
to a given objective. Models are functions which learn to map inputs to out- 


puts through mathematical transformations. ML problems often start with a 
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set of data (diverse input/output pairs). Training progresses as the model is 
incrementally modified until the diverse inputs are consistently transformed 
into the desired, respective outputs. Thus, ML concerns itself with how to 
program machines to learn a mapping, as opposed to explicitly defining the 
mapping function. 

Generalization is a secondary, albeit no less important, objective in ML. 
The trained model should be general enough to transform unseen inputs to 
their corresponding outputs. These held-out data-points should be separated 
and delineated prior to training. They are referred to as the validation data 
(used for development and tuning) and test data (used for final evaluation). 
An objective performance mismatch between training, validation, and test 
data can indicate a variety of issues in the training process. 

All ML is interactive. Thus, IML systems should be designed using frame- 
works for interaction with humans [284]. IML is a term coined by Fails and 
Olsen (2003) to describe their novel system where a human and an ML al- 
gorithm shared control to perform image segmentation [79]. Conventionally, 
IML is defined as a subfield of ML which explores the addition of a human, 
or multiple humans, in the learning loop. IML is sometimes referred to as 
human-in-the-loop learning. The human(s) may observe the learning and pro- 
vide input to improve training toward the performance objective. 

In many ML systems a human spends time preparing data, designing objec- 
tives, and tuning hyperparameters. These systems could rightly be considered 
IML. Most ML methods, while agnostic to the fact that the task is defined 
by a human, are ultimately optimizing toward an objective function which is 
defined by the subjective opinion of goodness of that human. If the human is 
only defining the objective and preparing the data, the interaction is loosely 
coupled. In many IML examples, the interaction can be more tightly cou- 
pled. For instance, in prosthetic robotics where human nerves are connected 
to machine wires [113]. 

The learning problem underlying IML is: how can machines and people 
interact to solve tasks more efficiently and effectively than either could on its 


own? [6]. It is helpful to consider several task examples. For instance, if a 
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Algorithm 1 Simplified interactive machine learning algorithm. 





Given a model M which maps to input I to output O; = M (I) 
Given a time horizon T, and a test of goodness {good, bad} = G(O,|J;) 
while t < T do 
obtain I, ~ I 
O; = M(t) 
if G(O,|[:) == good then 
modulate M to increase probability of O; = M (Iy) Vt! >t 
else {G(O,|J;) == bad} 
modulate M to decrease probability of O; = M (Iy) VE >t 





human provides a line of dialogue, the machine might discriminate between 
a set of underlying topics, predict a sentiment score value, or generate an 
intelligible response. Often tasks require adaptable collaboration over the 
course of incremental, continuous, and sequential decision making. 

While there are many implementation details, teaching ML systems to 
perform these tasks is straightforward. A high-level algorithm is shown in 
Algorithm 1. 

Algorithm 1 helps us to frame most ML. It also connects to the motivating 
question. To address this question, IML designers and teachers must define 
the goodness function, G(-), and use a method to modify the outputs of the 
model, M(-), toward maximizing accumulated goodness. M(-) maps from the 
space of inputs, J; € I, to a space of outputs, O, E€ O. It attempts to map 
from from all of the things an interacting human can input to all of the things 
the ML system can output. This interaction between human inputs and model 
outputs constrains IML systems [259]. 

While we have explored the basis of IML thus far, it is important to 
rephrase an expanded motivating question: “given human inputs, what are 
desirable model outputs?” IML systems should use information available by 
leveraging both the human- and the machine-in-the-loop. The machine should 
work on what it is best at, and so should the human, and interaction bandwidth 
capacity should used for sharing knowledge. Collaboration between multiple 
humans has historically been easier than with machines due, in large part, to 


shared communication. 


2.1.1 Shared Communication 


For effective human-machine collaboration, both intelligent systems—human 
and machine—should have a means by which to communicate their intentions, 
actions, goals, and understanding of the others. Human teaching allows agents 
to achieve arbitrary user-centric goals, improve an agent learning speed, and 
beneficially direct agent exploration [159]. 

There exists a continuum of IML systems, from loosely coupled to tightly 
coupled. This continuum is defined by the explicitness of interaction and the 
bandwidth of shared communication [261]. Tightly coupled humans might pro- 
vide input at each step of model prediction. Loosely coupled humans might 
only provide input prior to model learning prediction as in supervised learn- 
ing. Optimal learning machines might have a adaptive levels of autonomy. 
This sliding autonomy is deployed in Autopilot on new Tesla vehicles.? In 
rural highway driving the model could act more independently. In busy urban 
driving, the model requires more input from a human. 

There are a number of explicit communication channels the human may 
use for teaching a learning machine. Direct teaching signals take the form of 
scalar inputs like buttons, switches and/or dials [261], verbal and auditory sig- 
nals [323], tactile inputs, gestures [264], visual inputs for facial emotions [337], 
natural language [90], [165] and extend to neural interfaces capturing intention 
and signal directly from the human brain [143]. A 2015 review article discusses 
additional interactions [214]. 

Through direct ML-human interaction, the human often adapts their input 
to be more appropriate. This co-adaptation is echoed in the Joseph paradigm: 
“to teach is to learn twice over”.* In teaching a machine, the human is learning 
both how to teach the machine and how the machine learns. Extending this 
idea, we can imagine an ML system that can teach a human how effective 
they are at teaching. This transparency of learning can allow humans to 
better understand how to teach agents [325]. 


As two learning systems interact and communicate, they build a shared 
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language and a model of the understanding of the other. This communicative 
capital is built up over many back and forth interactions [259]. Reducing the 
number of interaction is a common goal of IML. Like capital investment, in the 
economic sense, communicative capital can require significant upfront cost in 
terms of time and effort to establish and maintain a communication channel. 
These costs may be incurred during the normal interactions of a partnership, 
or through preliminary effort external to the task-specific goals. 

Through all of this communication, the human is attempting to share with 
the learning machine a sense of “when I do this, I want this to happen”. The 
human is attempting to define a mapping from inputs to outputs and com- 
municate this mapping with the machine. The human co-adapts by refining 
which inputs they are providing to the machine, based what they have learned 
about how the machine learns from different inputs. 

This co-adaptation may be tightly coupled (e.g. the human pushes buttons 
to program a thermostat to turn on every morning), or loosely coupled (e.g. 
the thermostat builds a model of human interaction by finding patterns in an 
abstract data stream). While these two methods of co-adaptation are similar, 


they differ on the perception stream of the learning machine. 


2.1.2 Shared Perception 


Perception is the process of an intelligent system becoming aware of some- 
thing through its senses. The perception stream is the flow of continuous data 
input to the sensors and processed by the computational machinery. Clas- 
sic, non-learning, machines perceive inputs from humans through specially 
programmed sets of instructions represented in an abbreviated format (e.g. 
buttons on a remote control). Learning machines have expanded perception 
and capacity to process abstract data. 

Perception data streams are quantized into single binary digits (i.e. bit) 
streams of ones and zeros. Data can stream into learning machines from a 
wide variety of communication channels or perception sensors. Ongoing sensor 
innovation allows for arrays of sensors to be embedded in IML systems. These 


sensor arrays form perceptual streams for learning machines. When adding 
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sensors, there is a balance to be struck. With no perception it is impossible to 
communicate teaching and task signals. With too many sensors it is difficult 
to find meaningful correlations in inputs to help predict outputs. That is, it is 
hard to model patterns in data across all the sensor dimensions. The reduced 
predictive power due to a high dimensional space is sometimes referred to as 
the curse of dimensionality [19]. A narrow stream of data can only contain 
a limited amount of information [304], and in a stream too vast meaningful 
information is difficult to find. 

From the human perspective, we imagine interacting with learning ma- 
chines in a variety of ways. We can imagine that machines can learn from 
us through indirect sensors (passively observing parts of what we do), and/or 
through direct sensors. Humans are quick to build a mental model of teaching 
and learning through both their direct and indirect actions. 

We can perceive these learning machines through our five main senses (e.g. 
taste, sight, touch, smell, and hearing), through our additional sensor streams 
(e.g. balance, temperature, and proprioception), and critically through our 
sensing of time-passing, familiarity, recollection, and agency. We desire IML 
systems which respond immediately, in a familiar fashion, remembering our 
specific preferences, and allow us to take actions in the interaction. 

It is with this background on communication and perception we are able 
to discuss evaluation in IML. Several exciting current research directions in 


IML and additional considerations can be found in Appendix A. 


2.1.3 Evaluation 


Evaluation of any ML system is the very essence of optimization toward an 
evaluation metric (i.e. measure of goodness), that drove the simple learning 
algorithm previously presented (Algorithm 1). Evaluation is based on an ob- 
jective function, this function is most often defined by a human. To properly 
evaluate an IML system, care should be given to define this objective. 

There is a distinction to be made between evaluating the performance of the 
ML system on a task and evaluating the amount of physical effort, cognitive 


effort, and input is required from the human. Whereas task performance can be 
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measured and compared to optimal task performance, human factors will often 
be user-specific and may not correspond to the learning systems performance 
on a given task. This could be due to a misunderstanding of the communication 
channel by the human, by differing uses of the communication channel, and 
even by the human shifting how they communicate over the course of the 
task [120]. Boukhelifa et al. (2018) provide a comprehensive overview of 
quantitative and qualitative evaluations of IML systems [31]. Brief details 
are included below, as they deserve consideration when reading the following 


chapters. 


Human-Centric Evaluation 


Measuring the quality of the interaction between humans and ML systems 
often requires a human-centric design approach and solid user experience test- 
ing. Human-centred evaluations should explore the mental models of the IML 
system that the human develops. 

Questions focused on engagement, such as “how well could you concentrate 
on the task rather than on the mechanisms used to perform those tasks?” , and 
on perceived ability, such as “how much were you able to control events in the 
task performance?” can be asked to explore these important human-centric 
opinions. These questions are explored in an IML context in Chapter 4. 

The amount of training, and interaction, the training strategies, should be 
assessed. These assessments may reveal interesting patterns in the strategies of 
different groups of human users. Share-ability of a users trained model should 
be measurable (additional discussion in Appendix A). Can they effectively 
train a system and then share that training with someone else to catch that 
user’s system up with their own. Finally, privacy of the user’s data should be 
quantifiable and clearly communicated. Both, what is being captured in the 
perceptual stream and how that data is transformed, stored, transmitted, and 
shared need to be explicitly stated on any deployed IML system, similar to a 
websites privacy policy [7]. 
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Machine-Centric Evaluation 


Evaluating from the machine side is more straightforward, but there are several 
important features measurable which are specific to IML. The balance between 
the number of actions taken by the machine for each action by the human, for 
instance, correlates with the amount of agency each has in task performance. 
The machine must learn to find an objective balance between appeasing the 
goal of the human and maximizing task performance. Classic ML evaluations 
should be designed, executed, and reported for IML experimentation. Mea- 
sures of algorithm robustness, convergence, and diversity analysis, and task 
performance (e.g. number of training steps to asymptotic performance) are as 
vital in IML as they are in classic ML problems. 

Finally, one interesting evaluation metric is the correlation between user 
evaluation metrics and predictions of those values. Given that the IML system 
should be modelling the human user, and have some features of the human user 
in the perception stream, systems should make predictions about how a user 
would evaluate a system. Learned models of human evaluation are presented 


in Section 2.2 as they related to automated dialogue systems evaluation. 


2.1.4 IML in Summary 


Learning machines are gaining popularity with advances in communication 
and machine perception. These machines use IML which combines the skills 
and abilities of humans and machines to produce new capabilities. Interactive 
learning algorithms should be applied with an understanding that different 
humans interact with these systems differently; individual-specific dynamics 
should be explicitly modelled. Through open-source code and open data, sci- 
entists, creative technologists, and industry developers can quickly iterate by 
sharing their ideas and developments around the globe. We will soon develop, 
deploy, and disseminate IML systems with similar ease. 

This section has provides background on the field of IML. Collaboration 
toward optimal task performance can tightly coupled interaction. Insufficient 


communication between systems can be due to limits in information capacity 


13 


or transmission rate and/or misaligned models of perception. The next section 
concerns itself with one of the most information dense and rapid transmission 
modes of interaction which humans employ: communication through conver- 


sational dialogue. 


2.2 Conversational Dialogue Systems 


The ability of a machine to generate meaningful open conversational dialogue, 
which can not be distinguished from a human participant has been one pro- 
posed test of a successful AI [330]. The imitation game, was proposed by Alan 
Turing (1950) and is thus often referred to as the Turing test [330]. Passing 
the Turing test refers to the ability of a computer program to impersonate a 
human in a real-time conversational dialogue, using written communication. 
Impersonation is sufficient when a human judge is unable to distinguish, on 
the basis of the conversation alone, between the program and a human. 

The term dialogue comes from the Greek word dialogos, its roots are dia: 
through and logos: speech or reason [141]. In relation to the work of this thesis, 
we define dialogue as text-based, English language conversation between two 
participants as a means to share information. This conversation need not occur 
via speaking and listening, nor need it be grounded in an English language- 
based textual representation [42], [43]. The simplified and constrained domain 
permits focus on improvements to specific sub-components of conversation. 
Conversations can be categorized, based on participant goals, into one of four 
types: debate, dialogue, discourse, and diatribe.* Dialogue is separated from 
the other three as it is a co-operative, two-way conversation where the goal 
is to exchange information and build relationships with one another. Open 
dialogue is the sharing of opinions on subjects and topics that are brought up 
and/or thought of during conversation. 

Early work on text-based dialogue has expanded to include spoken dia- 
logue on personal mobile devices due, in large part, to advances in automated 


speech recognition and automatic speech generation. These advances include 
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Figure 2.1: A general dialogue system diagram. The entire dialogue loop con- 
tains (from top-left, counter-clockwise) a human interacting with the system, a 
microphone voice interface and speech recognition to convert audio waveforms 
to digital signals, natural language understanding to parse the recognized in- 
put, a dialogue manager optionally connected to external knowledge bases, 
natural language generation and speech synthesis to generate appropriate out- 
put audio from the generated response. The audio is output using speakers and 
the robot embodies response dialogue through emotive actions and movement. 


increases in speed and reductions in error rates [332]. These improvements are 
due in large part to replacing the conventional acoustic-feature modelling with 
supervised deep learning models [116]. These deep learning models are large 
networks of many parameters which define the mapping function from input 
to output. With enough data, these models can learn underlying patterns and 
correlations in high-dimensional data and generalize well to unseen input data. 

A dialogue system is comprised of three major components (as illustrated 
in Figure 2.1: 1) an input interface, 2) an output interface, and 3) a process- 
ing module. The processing module in a dialogue system is often called the 
dialogue manager, and it processes the input text, models the conversation as 
a whole, and generates a text response based on the input. 

Machines capable of conversation, sometimes referred to as conversational 
agents, are of growing importance in IML systems. These agents enable hu- 
mans to use natural, smooth, efficient communication to interact with their 
devices [233], [242]. These systems allow ML systems to collaborate and in- 
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teract with humans to perform simple tasks (e.g. booking flights or playing 
music), enable adaptive user-directed tutoring systems [187], provide social 
assistance through machine assisted therapy—including supporting cognitive- 
behavioural therapy for treating depression [85], helping patients with chronic 
pain [230], and supporting reminiscence [245]. 

Several advances in computational linguistics (i.e. natural language pro- 
cessing) have led to the growth in popularity of these systems. These ad- 


>. increased availability of 


vances include: an increase in computing power 
large amounts of clean training data [298], development of successful ML al- 
gorithms [30], [297], and a richer understanding of the structure of human 
language [117]. There are major challenges in designing, building, and de- 
ploying conversational agents [233]. Problems of robustness, scalability, data 
biases, domain adaptation, engineering infrastructure, privacy, and stability 
continue to limit the wide spread adoption of these systems. 

The focus of this thesis is on improvised theatre, thus the focus of this 


background material is on conversational modelling for generating interesting 


and engaging responses to inputs over the course of a conversational dialogue. 


2.2.1 Conversational Modelling 


The basic problem of conversational modelling involves taking a natural lan- 
guage input from a human and outputting an appropriate natural language 
response. As the system is engaged in an conversation, the system takes in 
human inputs, and outputs responses, in alternating sequence. This assump- 
tion imposes a turn-taking model of conversation on the system, a common 
feature in open conversational dialogue. 

A chatbot (also known as a chatterbot, bot, messenger bot, interactive 
agent, conversational dialogue agent, or artificial conversation entity) is a com- 
puter program which engages in turn-taking conversational dialogue. Chatbots 
can be differentiated based on their capabilities and/or method for conversa- 
tional modelling. For instance, taxonomies include open- vs. closed-domain 


dialogue systems (i.e. differentiating the space of knowledge of the system) and 
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rule-based vs. corpus-based dialogue systems (i.e. differentiating based on the 
underlying model for conversational modelling). Inputs are often text or audi- 
tory signals, but, may also include visual information and/or side-context rel- 
evant to the current conversation. Outputs are often text or auditory signals, 
but, may also include multi-modal generation and embodiment actuation. For 
instance, a dialogue system could actuate a physical embodiments or modify 
the state of the environment alongside the generated dialogue. 

Appendix B contains a brief history of chatbots by detailing the evolution 
from rule-based, to corpus-based, to unified conversational models. It also 
covers dialogue system competitions which have driven innovation and the 
rise of social and corporate bots. 

Rule-based chatbots, have a conversational model based on hard-coded 
rules. These hand-crafted systems are often heavily engineered to match a 
set of pre-defined inputs to pre-defined outputs. Corpus-based chatbots use a 
large corpus of data to generate outputs in response to inputs. These corpora 
are often too large to fit in the memory of a computer. Corpus-based bots 
are further subdivided into two categories: information retrieval (IR) meth- 
ods and generative language modelling (LM) methods. IR methods take as 
input a query sentence and then look for the best responses in a corpus of 
possible dialogue outputs. LM methods, sometimes referred to as statistical 
LMs or generative LMs, aim to model the statistics of the language used in a 
corpus and then generate responses to a given input based on the underlying 


probabilities. 


2.2.2 Information Retrieval for Chatbots 


Given a large set of possible responses (i.e. textual sequences of words), a 
conversation model could select a likely response from the set. The appropri- 
ateness of the selected response given an input is based on a human-defined 
objective function. This function is often based on a distance metric, mea- 
sured by modelling input-response pairs of the corpus. A variety of metrics 
have been used including linguistic similarity based on overlap of words, char- 


acters, and parts of speech [339]. Another metric used is distance of nearest 
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neighbours in a distributed semantic vector space [16]. 

As a representative example, Banchs et al. (2012) suggested the use of 
movie scripts to build chat oriented dialogue systems [16], [297]. By condi- 
tioning on one or more utterances, their model searched a database of movie 
scripts and retrieved an appropriate response. Ameixa et al. (2014) showed 
that move subtitles could be used to provide appropriate responses to out-of- 
domain questions using IR [5]. Another approach, using unstructured docu- 


ments for chatbot engines, is presented in Yan et al. (2016) [357]. 


2.2.3 Language Modelling 


Information retrieval methods are limited to a predefined corpus of responses. 
We may desire to generate new responses. This can be done using a statistical 
language model (LM) which is a probability distribution over a sequence of 
words. Using this LM, generating new responses corresponds to sampling from 
a probability distribution. The LM can also be used to compare the relative 
likelihoods of different sentences. 

To understand much of the following material on LMs, several definitions 
and a brief introduction to probability theory are required. Language mod- 
elling is best communicated with a working example sentence. Consider the 


following sequence of words: 
‘now i live in cortona italy’ 


Note that capitalization and punctuation are excluded from this example for 
clarity. This sentence is composed of six unique sub-elements or words. There 
is an order to the words, and the words in another different order may, or may 
not, convey the same meaning (e.g. consider the order “in cortona italy i now 
live” and the ambiguous mis-ordering “i cortona live now in italy”). 

If each of the words, has an associated probability of occurring, plwm), then 
the probability of the sentence, p(sentence), can be expressed as a ‘chain’ of 


probabilities, p(w1,...,Wm), given a sequence of length m. For our example 





®Capitalization and punctuation can be added and removed with invertible rule-based 
functions or learned with predictive models. 
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sentence, this corresponds to calculating the probability of the sequence of 6 
words, p(sentence) = p(w, We, W3, W4, W5, We). An LM can tell us the prob- 
ability of this sentence occurring and how it compares to the probability of 
other sentences occurring. For generation, each next word is sampled from the 
LM based on probabilities of words which tend to follow the current word and 
context in a given structured collection of sentences (i.e. corpus). 

We will start with a naive LM similar to rolling a fair die. Imagine on 
each face of a fair six-sided die is written one of the words from our work- 
ing example sentence. Each of the words has an equal chance of occurring 
(i.e. independent, uniform probability). The likelihood of generating our ex- 
ample would be calculated as the product of the independent probabilities: 
p(sentence) = J [+ p(wi) = (1/6)° = 1/46656 ~ 0.000021. 

There are 46, 656 possible permutations” of the 6 words in our vocabulary. 
Generating the example sentence we desire is quite unlikely. Also, it is equally 
as unlikely as generating our mis-ordered sentences. 

From this naive model, one improvement might be to note that the words 
‘now, i, in’ might occur more often in a subset of English text than the words 
‘live, italy, cortona’. So, we might weigh the probabilities of each word based 
on the relative likelihoods of each word occurring in a corpus. That is, we 
can calculate the ratio between how often a given word occurs and the total 


number of words in the corpus. Consider a hand-crafted, 3 sentence corpus: 


‘now i live in italy’ 
‘specifically in italy now i live in cortona’ 


‘cortona italy’ 


It is now clear that the relative likelihood of each word occuring in the corpus 
is not uniform, as in the first example. In this corpus, these are 15 words, 7 
of which are unique. Table 2.1 tracks the number of occurrences and relative 
likelihood of each unique word in the corpus. 

The data in Table 2.1 defines our unigram model. This model can be used 


to compute our desired probability p(sentence) by considering the probability 





TRecall that order matters. 
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Word Occurrences Relative Likelihood 
now 2 2/15 
i 2 2/15 
live 2 2/15 
in 3 3/15 
italy 3 3/15 
specifically 1 1/15 
cortona 2 2/15 

















Table 2.1: Unigram language modelling table of occurrences and relative like- 
lihoods in the 3 sentence corpus. 


of a term occurring given a context. Before continuing, it is helpful to reiterate 
Bayes’ theorem and the chain rule of conditional probabilities. 

Consider random words sampled from a vocabulary w ~ V. Then, the 
probabilities of observing wı and w are p(w ,) and p(w), respectively. But, 
the likelihood of observing w might be different if we have observed w1; the 
probability might be conditional on earlier observations. 

Bayes’ theorem states that the posterior probability is equal to the prior 
probability multiplied by the likelihood dividend by the evidence. The poste- 
rior probability is the conditional probability of wz occurring given w,;. Often 
times the evidence (i.e. the marginal probability of the w,) must be collected 


over all possible w € V. Formally, Bayes’ theorem states: 


p(wa) - p(wy | We) 
p(w) 

= p(wa) -p(wy | We) 

7 Jevy P(wilwe) . p(w) (2:2) 





p(we | wi) = (2.1) 





The chain rule of conditional probabilities states that any joint probability 
over many random variables may be decomposed into conditional distributions 


over only one variable: 


p(w;, wj) = p(w; | w;) - p(w;) = p(w; | wi) - p(wi) (2.3) 


and, over multiple variables as: 
20 


p a u) = [|p (u Ne) (2.4) 


Thus, decomposition over three random words w1, w2, w3 can be performed 





as follows: 


p(w, w2, w3) = plws | w2, w1) : p(we, w1) (2.5) 


= p(ws | w2, w1) - p(we | wi) - p(wı) (2.6) 


We might assume a “naive” conditional independence for the current uni- 
gram model. That is, we assume that each word w; is conditionally indepen- 
dent of each other word w; for 7 # i. Conditional independence implies that 
knowledge of one word occurring provides no information about the likelihood 


of another word occurring. We can thus calculate our desired probability as: 


p(sentence) = p(w , We, W3, W4, W5, We) (2.7) 
= p(w) + p(we|wi) - p(ws|we, w1) - p(walw3, w2, wi)... (2.8) 
= p(w1) + p(we) : p(ws) - p(wa) - plws) : p(we) (2.9) 


The model has a probability distribution over the vocabulary V. The sum 
of the probabilities in Table 2.1 sum to 1. We indicate the unigram model in 
the subscript of the probability notation. Then, the probability of a target 


sentence under the unigram LM is calculated as: 


Punigram(sentence) = I] Punigram (w) (2. 10) 


w € sentence 

The likelihood of our example sentence ‘now i live in cortona italy’, given 
our source corpus, can be calculated using likelihoods from Table 2.1 as 
Punigram = (2/15) * (2/15) * (2/15) * (3/15) * (2/15) * (3/15) & 0.000012. We 
can compare this to the naive uniform distribution model using the expanded 


7-word vocabulary of the 3 sentence corpus, Puniform = (1/ (Ob = 0.0000085. 
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This difference, Punigram > Puniform, represents a higher likelihood of this new 
LM generating sentences with a similar distribution to the corpus. The un- 
igram model is a relatively minor and rather obvious improvement from the 
uniform LM. 

In the unigram model, the probability of each word only depends on that 
word’s probability of occurring in the corpus, as opposed to a uniform prob- 
ability. Given Equation 2.10, an LM with a large vocabulary size, |V], will 
assign a low probability to most sentences. Minor probability , due to relative 
frequency, can make significant differences in language modelling. 

How might our unigram LM assign a probability to the sentence ‘i live in 
edmonton’? As the word ‘edmonton’ was never seen in the input corpus, what 
probability should the model assign? Just because the input never occurred in 
the training data, that does not mean it will not occur in the testing data. Sev- 
eral solutions have been proposed to handle words that are out-of-vocabulary. 
One common solution is to use smoothing by assigning a low probability to 
every out-of-vocabulary word [51], [198]. Another solution is to limit the vo- 
cabulary of the model by replacing each out-of-vocabulary word with a special 
token ‘OOV’ [354]. Then, as the model may see multiple ‘OOV’ tokens, it is 
possible to estimate the probability of sentences involving out-of-vocabulary 
words. Alternative methods include using characters, sub-word units, or byte- 
pair encoding methods to improve coverage by modelling the corpus at a more 
granular level [296]. 

We can extend the unigram model to an n-gram model by considering the 
preceding n words of context in our probability calculation. The concept is to 
model the structure of a corpus as the probability of words occurring alone and 
with context. The simplest of the n-gram models, unigram or n = 1, considers 
words in isolation (i.e. with no consideration of word order). Bigrams (n = 2), 
and higher order n-gram models (n = 3,...), consider the probability of a word 
occurring in the context of the preceding n — 1 words. 

N-gram models require 2 additional important tokens, ‘START’ and ‘END’, 
at the start and end of the corpus sentences, respectively. These tokens help 


to model the fact that some words are more likely to occur at the start or end 
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of sentences. We modify the corpus by appending these tokens: 


‘START now i live in italy END’ 
‘START specifically in italy now 1 live in cortona END’ 
‘START cortona italy END’ 


We also append these tokens to the working example target sentence which 
becomes: ‘START now i live in cortona italy END’. The joint probability of 


the set of words can be expressed using conditional probability: 


p(‘START now i live in cortona italy END’) (2.11) 
= p(‘now’|‘START?’) x p(‘i’|‘now’) * -- -x p(“END’|‘italy’) (2.12) 
(2.13) 


The conditional probability is calculated by dividing the number of occur- 
rences of the sequence(|w;_1 wi| by the total number of occurrences of the 
context word |w;—1|. Absolute value notation is used to denote the number of 


occurrences of a word in the source corpus: 
p(wilwi1) = [wiz wi|/|wi-a| (2.14) 


This calculates the relative likelihood of a pair of words, and maintains a 
measure of likelihood of two words occurring in a specific order. The model 
addresses two important issues. First, the unigram model contained no knowl- 
edge of ordering, with our new model the probability of the word pair ‘now 
i? is higher than the probability of the word pair ‘i now’. This appropriately 
reflects the corpus statistics. Also, the unigram model may generate repeti- 
tions of high-probability words (e.g. ‘now now now ...’), our improved model 
accounts for this by assigning these repetitions a low probability. 

The probability of the first word is conditional on being preceded by the 
‘START? token, p(w;,|‘START’), and similarly for the last word, p(‘END’|w,,) 
for a sentence with m words. Table 2.2 presents word pair occurrence counts, 
note that only bigrams from the corpus are included. All others bigrams are of 
frequency 0, thus a smoothed relative likelihood near 0 might be assigned. Note 


that given a 3 sentence corpus, there are 3 occurrences of both the ‘START’ and 
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Bigram Occurrences 


START now 
now i 
i live 
live in 
in italy 
italy END 
START specifically 
specifically in 
italy now 
in cortona 
cortona END 
START cortona 
cortona italy 








=. 
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Table 2.2: Bigram language modelling table of occurrences and relative likeli- 
hoods of the modified 3 sentence corpus. 


‘END’ tokens. The rest of the necessary occurrences for probability calculation 
are in Table 2.1. It is now straightforward to calculate the probability of the 


example sentence under the bigram model: 


p(‘START now i live in cortona italy END’) 
= p(‘now’|‘START”) - p(‘i’|‘now’) - p(‘live’|‘i’) - p(‘in’|‘live’) 
- p(‘cortona’|‘in’) - p(‘italy’|‘cortona’) - p(‘END’|‘italy’) 
7 jee now | \( |‘now 1 giay END'| 
-^ FSTART)| |‘now’| “° - |‘italy’| 
= (1/3) - (2/2) - (2/2) - (2/2) - (1/3) - (1/2) - (2/8) 
= 1/27 = 0.087... 


) (Eq. 2.14) 





There are still limitations to the bigram model. Depending on smoothing val- 
ues, simple and correct sentences may have very low probability (e.g. ‘cortona 
in italy’). Also, there is a non-zero probability that, when sampling from the 
model for sentence generation, the sentence never reaches the ‘END’ token (e.g. 
‘now i live in italy now i live in italy now ...’). Truncation at a certain length 
can help address the second problem, and comparisons between smoothing 


techniques are discussed in past research [51]. With these examples we have 
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covered the core mathematical concepts for n-gram LMs. The probability of 
observing w; in context can be approximated by the probability of observing it 
in the shortened context history of the preceding n — 1 words. The probability 


p(sentence) of observing the sentence ‘ww... Wm is approximated as: 


) 


p(sentence) = p(wi,...,Wm 
= [ol | wi,...,Wi-1) 
i=l 


2 
=: 


p(wi | Wi-(n-1))++- Wes) 


<. 
lI 
= 


The conditional probability of a sentence is then calculated using n-gram oc- 


currences and frequency counts: 


|wi-(n—1) s+. Wi-1 wil 





p(w; | Wi-(n-1))++* ,Wi-1) = [wi EO Wi-1| 


It may be helpful to contextualize this model by comparison with other se- 
quence memory techniques in ML. Specifically, what is the relationship be- 
tween Hidden Markov Models (HMMs) and n-gram models? The n-gram 
model is a probabilistic LM for predicting the next item in a sequence in 
the form of a (n — 1)-order Markov model [268]. That is, independence as- 
sumptions are made so that each word depends on only the last (n — 1) words. 
This is a Markov model approximation of the true underlying language. Thus, 
an n-gram calculates p(w;|wi-(n—1),--., Wi). In an HMM model, the states of 
the Markov chain are words and transition probabilities govern the transitions 
from word to word. These probabilities are derived based on relative frequen- 
cies of n-grams in the corpus. Only the outcome of the transition is output, 
and thus the state is said to be hidden from the observer. 

Given a suitable source corpus, these models can perform well for some ap- 
plications. HMM models are used in performance® and narrative script mod- 
elling, due to their speed of training and consistent, appreciable results [247]. 

This section covered how to build a conditional probabilistic LM given a 


source corpus. This LM can be used to generate sentences and calculate the 





Snttp://botnik.org 
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probability of a given test sentence under the LM. As n-grams and HMMs 
have limited contextual history with which to generate responses, researchers 


have turned to conversational modelling techniques with increased capacity. 


2.2.4 Sequence-to-Sequence Modelling 


Rule-based methods can provide adequate responses to a constrained set of 
inputs, but the output space is limited. Neural models can generate new 
and interesting responses, which a controllable amount of randomness, by 
modelling sequences of words in a large input corpus. 

Text-based conversational response generation and translation both involve 
the transduction of information in sequences of sentences. They both involve 
understanding the meaning and context of a sequence of words and the gen- 
eration of a meaningful response. In dialogue, the response forwards the con- 
versation, in translation the response contains the same information in a new 
language. Previous work has drawn this analogy in greater detail and used 
machine translation techniques for dialogue response generation [172], [280]. 

Ritter et al. (2011) defined the problem of generating locally coherent, pur- 
poseful dialogue as generating an appropriate response for any conversational 
input [280]. The authors describe how generating appropriate conversational 
responses is more difficult than translation between languages due to two main 
factors: 1) the large range of plausible responses, and 2) the potential lack of 
semantic alignment between input and output pairs [188]. Given that trans- 
lation models attempt to capture input sequence information in the output, 
it can lead to the echoing problem [82]. The authors took two measures to 
avoid the system from paraphrasing the input sequence in the output sequence. 
They discourage lexical similarity by removing substring phrase-pairs from the 
training data, and penalize their model by Jaccard similarity of the input and 
output strings. Jaccard similarity is a rapidly calculable measure of similarity 
defined as the size of the intersection divided by the size of the union of two 


sample sets [136]. 
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Backpropagation in Feed-forward and Recurrent Neural Networks 
A feed-forward neural network is a function mapping from inputs to outputs. 
It is composed of computational units called artificial neurons (or nodes). It 
may be single layered to multi-layered, depending on the composition of the 
computational units. Each node a network has an associated weight and non- 
linear activation function. This nonlinearity allows these networks to distin- 
guish data that is not linearly separable. A perceptron is an artificial neuron 
which uses the Heaviside step function as the nonlinear activation function. 
Multi-layered perceptrons (MLPs) consist of at least three layers of nodes: 
input, hidden, and output. MLPs use a supervised learning technique called 
backpropagation for training [282]. 

Backpropagation is used by the gradient descent optimization algorithm 
to adjust the weight of neurons by calculating the gradient of the loss func- 
tion. Gradient descent is a first-order optimization algorithm for finding the 
minimum of a function. Backpropagation, or the backwards propagation of 
errors, is a method to calculate the gradient needed to update the weights (or 
parameters) associated to the nodes in a neural network. 

Feed-forward networks have connections between the nodes which do not 
form cycles, or loops in the network [292]. Information flows in one direction: 
from input nodes, through the hidden layer nodes, to the output nodes. This 
property differentiates it from a recurrent neural networks (RNN) which is a 
sequence compatible generalization of a feed-forward network where connec- 
tions between nodes form a directed graph along a sequence. At high level, 
RNNs have an internal state, or memory, which allows for information to per- 
sist. An RNN can be thought of as multiple copies of the same network, each 
passing a message to a successor network. It may help to imagine unrolling 
an RNN to reveal a chain-like sequence for feed-forward networks. 

With this background, a mathematical description of an RNN is now pro- 
vided. An RNN computes the conditional probability of a output sequence 
of length n given an input sequence of length m, or p(y,---,Yn|T1,---;%m)- 
The affine transformation is defined by the multiplication of a linear trans- 


formation, weight matrix W, in the input space and subsequent addition of a 
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bias vector, b, in the output space, (W -x + b), in the equations below bias 
vectors for the hidden and output layers, b, and by respectively, are excluded 
for simplicity. 


Let subscripts denote the time step (or sequence index) and superscripts 











denote the layer in the network. Assume states are n-dimensional, thus h} € R” 





is the hidden state in layer l at step t. Then, given a source sentence S of length 
m, represented by the sequence of input vectors S = (2,...,U%m), a single 
hidden layer RNN computes a sequence of hidden states (hi,...,h!,) and a 
target sequence of output vectors T = (y1,..-,Yn) of length n by iterating over 


the following equations for t = 1...n: 


hl = o(W x, + W”? hl) (2.15) 
y =W” hy (2.16) 


In these equations, hł is the hidden state vector at time step t, and the 
activations h! predict y, W"* is the input-to-hidden weight matrix, W’” is 
the hidden-to-hidden weight matrix, and W¥" is the hidden-to-output weight 
matrix. h! ; is initialized at time t = 1 with an initialization vector [317]. 
Note the use of the sigmoid activation function, o(-). This non-linear activa- 
tion function is bounded, differentiable, real function defined for all real input 
values and has a non-negative derivative at every point. Other, activation 
functions may be used—the hyperbolic tangent, tanh, or the rectified linear 
unit (or ReLU), for instance—and may provide training speed and performance 
improvements [240]. 

An RNN model such as this was used by Cho et al. (2014) with a fixed- 
sized vector, and Karpathy (2015) has illustrated the capacity of these models 
to generate compelling outputs.? One of the limitations of RNNs is capturing 
long term dependencies. This limitation was directly addressed when Hochre- 
iter and Schmidhuber (1997) introduced the Long Short Term Memory cell, 
or LSTM [123]. The LSTM was designed to avoid the long-term dependency 
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problem by using a more powerful update equation and improved backprop- 
agation dynamics. The cells were designed to ‘remember’ information over 
many steps. 

The key idea behind LSTMs is that each individual LSTM cell has a notion 
of state. The state allows for information to flow through the cell unchanged, 
it also has capacity to remove or add information to the cell state by using 
information gates. An LSTM can be compared to the basic RNN cell by 
comparing the mathematical equations underlying the information flow of the 
input-output mapping functions. Breaking the LSTM down into three compo- 
nents (or gates) clarifies cell dynamics; we have the 1) forget gate, f, 2) input 
gate, i, and 3) output gate, o. c represents the memory cell state. Below, ma- 


trix W9 represents input weights and U% represents recurrent weights where 














q € (f,i,0,c). Bias vectors, bg € R”, are excluded for clarity. 


fi = o(WFa, + UT hy_1) (2.17) 
i, = 0 (W'x, + U hi1) (2.18) 
Op = o(W° x, + U? hi1) (2.19) 
G = ft O -1 + i © tanh(W°a, + Uhi) (2.20) 
h; = oœ © tanh (c) (2.21) 


Initial values are defined as cg = 0 and ho = 0 and the operator © denotes 




















the Hadamard or element-wise product. Also, 7 € R%, fi, its, Or, hi, @ € RP 








’ 


W € R’™¢ and U € R’". Weights matrices W4 and U%, and bias vectors by 


























are the parameters learned during training. 

The parameters of an RNN with LSTM units can be trained, on a set 
of training data, using gradient descent with backpropagation. This update 
iteratively modifies the weights of the network in proportion to the derivative 
of the error with respect to the weight. Using gradient descent for standard 
RNNs can result in vanishing or exploding gradients [122]. With LSTM units, 
as errors propagate backwards from the output layer, the error remains in the 


LSTM unit’s cell. This maintains the propagation of error back to each of the 
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LSTM gates [123]. 

RNNs with LSTM cells, often called LSTMs for short, can be used for 
both the encoder and decoder (Figure 2.2). They may be unidirectional or 
bidirectional (scanning over input in one or both directions) [294], single or 
multi-layered (network depth). We now provide details on designing, training, 


and using sequence-to-sequence neural network models. 


Sequence-to-Sequence Models Neural response generation models use 
a neural network to map input (or source) sequences of words, S = 
T1, Z2,..., Zm, to output (or target) sequences of words, T = yj, yo,---;Yn- 
We will refer to these models handling sequences of words, but these models 
can work on other text units (e.g. subwords, characters, word pairs). Also, 
length of these sequences (n and m) need not be equal. Sequence-to-sequence 
models (SEQ2SEQ) are one form of neural response model which have found 
success in machine translation, speech recognition, text summarization, and 
dialogue generation [15], [52], [302], [313], [318]. SEQ2SEQ models maximize 
the probability of generating a sequence of target words given a sequence of 
source words. SEQ2SEQ models make it possible to condition the response 
on a longer historical context than previous models [280]. 

These models learn conversational dialogue response generation by predict- 
ing responses and using a maximum-likelihood estimation (MLE) objective 
function [257], [318]. The model is trained to maximize the probability of the 
target sentence given the source. As these probabilities can be small in prac- 
tice, it is more numerically stable to work with the logarithm of the likelihood 
(log-likelihood). 

A SEQ2SEQ model learns parameters, 6*, which maximize the conditional 
log-likelihood of producing the correct output T given input S for all pairs in 


a set of training data D: 


. 1 
0 = arg max 5j pR log p(T | S) (2.22) 


| | (S,T)ED 
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As the model is often differentiable, gradient-based training such as mini- 
batch stochastic gradient descent can be used to estimate model parame- 
ters [52]. A modified version of backpropagation, Backpropagation through 
time (BPTT), is used for RNNs as parameters are shared across time steps in 
the network. The gradient calculated at each output depends not only on the 
current time step, but also previous time steps. 

The trained model can be used to assign a perplexity to a given source- 
target pair (S,T). It can also be used for generation by predicting the most 
likely target, Î, according to the model, formally: 


T = arg max po(T|S) (2.23) 

The most likely targets are found using a left-to-right beam search decod- 
ing scheme. Beam search maintains a number, B, of partial target hypothe- 
ses [318]. Beam search decoding is an approximate, greedy algorithm with uses 
a best-first heuristic search to explore a possibility space by expanding the most 
promising hypotheses.’? At each time step the prefix hypotheses are extended 
with each word in the vocabulary. Using the log probability calculated with 
the model, all but the B most likely hypotheses are discarded. When ‘END’ 
token is appended to a hypothesis, that hypothesis is removed from the beam 
search and added to a complete candidate list. This complete hypothesis list 
may then be re-scored using an additional scoring function [318]. 

This MLE objective may provide likely responses, but, it is likely not the 
optimal surrogate objective for a dialogue agent [176]. Analogous to the rep- 
etition problem due to loops in n-gram models described above, MLE-based 
SEQ2SEQ models may not account for repetition. As well, SEQ2SEQ models 
tend toward generic responses (e.g. ‘no’ or ‘i dont know’) given their relatively 


high frequency in many source corpora [176], [177], [801]. 


Encoder - Embedding - Decoder Model The objective of sequence pre- 


diction models is to predict an output sequence given an input sequence. 





Mnttp://www.dtic.mil/dtic/tr/fulltext/u2/a049288. pdf 
31 











© eo? 
I am so glad you called , ce eo 
MODEL SS) 
ENCODER am "A iii 
you 
j glad 
CONTEXT: “breaking up” DECODER 


























Figure 2.2: Figure of the encoder / decoder framework. 


SEQ2SEQ models serve as the basis for many current chatbots [299], and 
these models use similar architectures (illustrated in Figure 2.2). At a high 
level, SEQ2SEQ models are composed of an input encoder (which maps words 
into fixed length embeddings), an embedding processing model (generates re- 
sponses embeddings to inputs), and a output decoder (which maps embeddings 
to words). Additional context information might be appended to the inputs. 
There are many classes of sequence modelling problems based on the relative 
shapes of input and output sequences.'! Generally, there is an input sequence 
of multiple tokens mapped to an output sequence of multiple tokens. We may 
want to map a sequence input to a single class, or alternatively, map a single 
token to a sequence of output tokens. These are special cases of the general 
sequence modelling problem. 

The encoder maps a variable-length input sequence (source) to a fixed- 
length vector. The decoder maps a fixed-length vector to a variable-length 
output sequence (target). Often the encoder and decoder are separate func- 
tions, but they may share weights to speed-up learning. This technique is 
particularly useful when using deep neural network networks with many pa- 
rameters [52], [168]. 

Similar to the n-gram model, a vocabulary list defines the words that the 
model can encode and decode. In addition to the special tokens introduced 
prior (START’, ‘END’, ‘OOV’) neural models may also utilize the ‘PAD’ 
token. This token allows for batches or source/target pairs to be padded to 
constant width. This processing technicality improves performance of models 


at test time. Fixed dimensionality embeddings allows for the use of deep 
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neural networks (DNNs), which have been shown to be powerful models on 
many language tasks [70], [216], [318]. 

LMs are partly inspired from theories of context-free grammars and gener- 
ative grammars, introduced by Chomsky (1959) coupling syntax and semantic 
information in sentences [53]. Grammar is a set of rules governing the com- 
position of words in a language. Each language is based on a grammar. the 
order of words has influence on the meaning of the sentence. Discriminat- 
ing the grammar underlying sentences is paramount when parsing sentences 
for language understanding [267], [287]. In fact, neural language models have 
been used to track abstract hierarchical syntactic structure [102] and learn 
context-free grammars [295]. Learning semantic and syntactic information 
will to improve language modelling. These works support the notion that neu- 
ral models can learn to track grammatical structure and meaning; a hypothesis 


which motivated early neural network research [77], [283]. 


2.2.5 Neural Model Improvements 


In the SEQ2SEQ model, the last source state is passed from the encoder to the 
decoder when starting the decoding process. The single fixed-size hidden state 
becomes an information bottleneck for long sentences. Similar in motivation 
to the LSTM, the concept of ‘attention’ was introduced to give the decoder 
more direct access to the information in the input [15], [197]. The key idea 
of attention mechanisms are to establish direct short-cut connections between 
the target output and the source input by attending to relevant source content 
during translation. Instead of discarding all hidden states computed in the 
source RNN, attention allows the decoder to peek at them (treating them 
as a dynamic memory of the source information). The attention mechanism 
improves the translation of longer sequences [15], [197]. Attention mechanisms 
have been applied to many sequence modelling tasks including image caption 
generation, speech recognition, and text summarization.” 

The Transformer model is a sequence modelling neural network architec- 


ture based on self-attention [335]. The self-attention mechanism directly mod- 
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els relationships between all words in a sentence, regardless of their respec- 
tive position. The Transformer model assigns every word in the sentence 
with a relative position encoding. These positions are added to the embed- 
ded representation. The benefits of the Transformer model include increased 
quality of sequence generation, better parallelization on modern hardware, 
better generalization, and less time to train [334], [335]. The Transformer 
model has outperformed recurrent and convolutional networks a diverse set of 
sequence modelling tasks [335].The open-source Tensor2Tensor Transformer 
framework!’ was used for the experiment in Chapter 5. 

Thus, we come full circle. From naive rule-based chatbots, to n-gram and 
Hidden Markov Models (HMMs), to finding inspiration in the methods of sta- 
tistical machine translation, and advancing neural response generation with 
the long-short term memory (LSTM) and Transformers. Additional consid- 
erations for corpus based chatbots and neural model training are included in 
Appendix C. Many neural modelling techniques and experiments would not be 
possible without two crucially important additional factors: suitable datasets 


and rigorous evaluation. 


2.2.6 Datasets 


There is an adage in computing science that garbage-in leads to garbage- 
out [14], and so it is with conversational dialogue systems. These systems and 
continue to rely on large corpora of high-quality data collected, indexed, and 
accessible by the model at training and inference time. There are many large 
datasets available to train dialogue systems [298]. This survey from Serban et 
al. (2015) attempts to collect these datasets and profile their characteristics 
to ease in selection and use. Data collection and sharing is an active area of 
research in the field, and with the continued deployment of conversational sys- 
tems, these datasets will continue to grow in scale and diversity of language, 
task, and structure. 

Datasets used in this thesis work include the Cornell Movie-Dialog Corpus 


(CMDC) and OpenSubtitles (OS). The CMDC [65] contains 220,579 conver- 





3https://github.com/tensorflow/tensor2tensor 
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sational exchanges between 10,292 pairs of movie characters 9,035 characters 
from 617 movies for a total of 304,713 utterances. The OS dataset, while 
considerably larger at 446,612 movie subtitle files, covering more than 110,000 
unique movies identifiers (or IMDB IDs)!" for a total of 441.5 million lines, is 
nosier data (e.g. duplicates, spelling and grammar mistakes) [185]. It consists 
of user contributed movie and TV subtitles which are then processed with 
an automated processing pipeline. OS constitutes a resource for the conver- 
sational dialogue From a linguistic perspective, movie subtitles cover a wide 
range of genres, from colloquial language or slang to narrative and expository 


discourse as in documentaries [185]. 


2.2.7 Evaluating Dialogue Systems 


How do we go about evaluating dialogue systems when the task is not related 
to a binary outcome, as is common in goal-oriented or non-conversational 
dialogue? This question is of particular interest for the open-domain dialogue 
systems discussed in the following chapters. 

Chatbots can be separated into open- and closed-domain systems. Open- 
domain dialogue systems (i.e. chit-chat bots, chatter bots) can discuss a di- 
verse set of topics and domains. Closed-domain systems (i.e. domain specific, 
task- or goal-oriented bots) limit conversation to a specific domain or task. 
Evaluation of task-directed agents is considered more straightforward than for 
open-domain bots [270], [338]. 

Interlocutors engaged in chit-chat conversation often do have goals.'® Cog- 
nitive science and conversational linguistics has explored the many underlying 
goals of conversational dialogue [161], such as: generating shared knowledge, 
obtaining information, delivering action instruction, understanding motiva- 
tion, and/or building relationships. These goals are more abstract than an 
closed-domain, outcome-centric, booking task (e.g. “reserve a table for 4 at 


6:00pm at the Italian restaurant down the block”). Goals also change over the 





M4http://opus.nlpl.eu/OpenSubtitles2018. php 
https: //www.imdb.com/ 
16 Conversational partners. 
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course of a dialogue [161]. 

The goals listed above are a step towards evaluation, but they are still 
underspecified. What we desire is a testable condition for success; a mea- 
sure allowing us to compare one model against another. In the chapters that 
follow, we describe how specific concepts from ‘improvised theatre dialogue’ 
can provide these measures. Before discussing these, it is helpful to provide 
background on current evaluation techniques. 

Early work in dialogue generation used human interactions to test systems 
(Appendix B). When automated optimization of these models became neces- 
sary, due in part to the large number of parameters and hyperparameters, the 
community looked to statistical machine translation (SML) for metrics [280]. 
As Liu et al. (2016) show, these metrics correlate weakly with human judge- 
ments [188]. Additional work has shown weak correlations between humans 
and other automatic metrics [193]. 

Automated metrics from SML [296], including overlap-based methods such 
as Jaccard similarity, BLEU [251], METEOR [17], ROUGE [183], CIDEr [336], 
are used for machine translation, summarization, and image captioning tasks. 
These methods assume a valid, aligned, output for a given input has signif- 
icant word overlap with the ground truth response. For instance, bilingual 
evaluation understudy (BLEU) evaluates the quality of text which has been 
machine-translated from one natural language to another. While this inspired 
model innovation and development for previous research [280], word overlap 
need not exist in conversational dialogue [100]. 

Training data specifies one notion of naturalness, so we can measure how 
well the output data matches the input data. Evaluating an LM is often done 
by measuring the perplexity [304], a measurement of how well a probability 
distribution predicts a given sample. This is a useful extension of the likelihood 
calculations from Section 2.2.3. Recall that an LM represents a probability 
distribution over a given corpus. For a given LM, low perplexity indicates a 
high likelihood, conversely, high perplexity indicates a low likelihood sample. 

Entropy is a measure of the expected number of bits required to encode 


the outcome of a random variable. Assume we have a sequence of words, 
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W1, W2,..., Wm from a vocabulary V. The probability of a given word, wi, 
is dependent on the previous words, wy 1,...,wj-1. The per-word entropy, H, 


represents the new information (in bits) provided by each word on average: 


H = — lim È X (p(wr,.--, wm) -logp(wi,..-,tm)) (2.24) 


Where the sum is over all word sequence permutations. If we assume the 
LM to be ergodic and stationary, then for large values of m, the approximate 


entropy is [83]: 


a 1 
H= E log p(w 1,...,;Wm) (2.25) 


An ideal LM would have a per-word entropy of H. We use a measure 
related to entropy to compare language models. The perplexity of a probability 
distribution, p, is related to the entropy, H, of a distribution (in bits): 











perplexity = git (2.26) 
= p(wi,...,Wm)7™ by Eq. 2.25 (2.27) 
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= I] - by chain rule (2.29) 

2, D(wi|wi,..- Wii 
(2.30) 
where p(wi,...,Wm) is the probability estimate of the sequence of words, 


W1,---,Wm, estimated by the LM. 

While perplexity provides a measure of the likelihood of a response based 
on a training corpus, and a means to compare trained language models, there 
are limitations. Plausible responses with out-of-vocabulary words might be 
assigned a low likelihood. Also, while perplexity is meaningful and available 
for probabilistic LMs (i.e. generative models capable of assigning a probability 


to a candidate response), some models are not applicable to this metric. Thus, 
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it is difficult to use this metric for model comparison. Additionally, an naive 
echoing model (i.e. which repeats inputs) would generate ‘good’ responses 
under this notion [82]. These outputs are well within the input distribution, 
but the conversation lacks a quality familiar to anyone who has conversed. 
The responses do not add additional information to the conversation. We need 
to account for the fact that good conversation incrementally, collaboratively 
builds information at each conversational step. 

Humans (i.e. crowd-sourced) have generated ‘good’ responses and/or pro- 
vided scores for the model generated responses. For instance, raters may be 
shown context and then pick their preferences between candidate outputs [188]. 
This technique is costly in both time and money. Also, human raters can be 
subjectively biased (e.g. toward short responses) and can have high variance 
between raters based on personal differences (e.g. reading ability). 

Additionally, it is difficult to ask the right questions to get meaningful, re- 
producible measures from crowd raters [73]. Several studies have asked ques- 
tions related to overall appropriateness of response, whether the response is on- 
topic, whether the response is specific, and whether the response requires addi- 
tional background information [178], [272], [300], [338]. Researchers have also 
asked raters if a given conversation would be one they want to continue [154]. 

With enough humans agreeing on ratings for an input/output pair, it is 
perceivable that a model could be built to predict human ratings on unseen 
inputs [73], [193]. These works allow for the prediction of human ratings, 
which can allow for automated testing and comparison of model improvements 
without costly human-participant studies. 

It is paramount to test models with real-life human interaction developing 
models designed for human interaction. Competition-based evaluations (dis- 
cussed in Appendix B) include ‘wild’ evaluations which involve human con- 
versations with bots. Following the conversation, the human converser scores 
the interaction based on various measures which, depending on the competi- 
tion, which might include measures of ‘adequacy’, ‘informativeness’, ‘fluency’, 
‘consistency’, and ‘engagingness’ or more abstract measures of ‘how much a 


person liked the conversation’ or ‘how well the bot stuck with a consistent 
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persona’ [127], [250], [864]. Additional evaluation methods are presented and 
discussed in Appendix C.1. 

Systems can model the personality [364], mood [93], and emotion [47] of 
generated sentences as well as human affect in interactive settings [362]. By 
modelling the psychological state of the user, and adapting accordingly (much 
like human interlocutors do) conversational systems will continue to evolve and 
personalize [50]. Ongoing appraisal and adaptation required significant lan- 
guage and non-verbal interaction understanding. The transition from building 
systems which reflect the underlying training dataset, to systems which adapt 
pre-trained models to ongoing interaction, is an ongoing area of research and 
discussed in detail in Chapter 7. 

It is important to evaluate from both human-centric and machine-centric 
perspectives. The evaluation of dialogue systems continues to be explored 
as an active area of research [188]. Dialogue systems should be measured 
and evaluated based on the interaction that they facilitate. In the work that 
follows in future chapters, we lean towards evaluating the dialogue system by 
evaluating the application where they are used. By measuring qualities of 
the application of a model down-stream, we gain valuable insights to compare 


models (Chapter 3). 


2.2.8 Conversational Dialogue Summary 


This section provides an review of the history, the implementation, and evalu- 
ation of conversational dialogue systems. This chapter has covered interactive 
machine learning (Section 2.1) and dialogue systems (Section 2.2), but they 
have remained loosely coupled. There is a missing connection in the loop. 
That missing piece is a real-time interactive test bed for conversational dia- 
logue systems. The next section provides background and motivation for an 
ideal testing ground for IML-based conversation systems: the applied Turing 


test of live improvisational theatre. 
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2.3 Improvisational Theatre 


Improvisational theatre (or improv) is a form of theatre in which some or all 
components of the performance are unplanned or unscripted. Improvisation 
is performing without any preparation or planning [109]. Improv is theatre 
conceived, written, directed, and performed in the same moment often in front 
of a live audience. An abridged history of improvising intelligences is included 
in Appendix D. 

The ‘offer’ is a common term across improvisational theory [147]. It defines 
the basic unit of collaboration between individuals. Offers can be physical, 
mental, or verbal. For the majority of the work in this thesis, I focus on verbal 
offers with physical embodiments (e.g. robots and projections), extensions to 
additional means of delivering offers are discussed in Chapter 7. All offers can 
be characterized by how much they accept and how much the extend [147]. 
These properties are sometimes summarized with the concept of “Yes, and...”, 
common in improvisation. 

When given an offer in an improvisational scene, an improvisor will re- 
spond with an offer of their own. The responding offer can either ‘accept’ or 
‘reject’ the initial offer. Additionally, the response can either ‘extend’ (build 
on) or ‘advance’ (move past) the initial offer. Improv scenes are composed of 
sequences of offers, each verbal offer is a sequence of words with additional 
features of character: intonation, speed, and directness. These sequences of 
words, and sequences of offers, will be what interactive artificial improvisors 
model and generate using the techniques from Section 2.2. Concepts of accept- 
ing and extending are further contextualized in the framework of generative 
language models in Chapter 5 when they are used to model underlying narra- 


tive arcs. 


2.3.1 Theatre Technology and Automata 


The theatrical stage is a reflection of our natural world. Much like the rest 
of our world, theatre is intrinsically enabled by technology [115]. Technology 


with which we augment our lives, including screens, computers, lights, and 
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videos, have thus found their way into modern theatrical productions. Tech- 
nical theatre is a field which includes stage and set design, lighting controls 
and cues, sound design, and visual projection design. In live performance, 
stagecraft is as old theatre itself [108], [316]. To Shakespeare, the theatre was 
a space-time machine driven by the energy of imagination [115], [164]. Not 
only is the incorporation of technical theatre into productions functional, it is 
as artistic as the performance itself. 

As natural as humans and technology sharing the stage is, it is impor- 
tant to inspect the digital-augmentation from two perspectives: 1) how can 
humans best be served by the technology of the production? And, 2) how 
can the technology inspire as a source of creativity, or co-creator, on stage? 
Answers to these broad questions, in regards to performative consequences of 
incorporating technology in live improvisation, are explored in this section. As 
above, it is important to appreciate history to gain perspective of the future. 

The word automaton comes from Greek origins and means roughly “acting 
of one’s own will”. It was used by Homer to describe an automatic door 
opener [333], and has now come to refer to non-electronic moving machines. 
Automata are found throughout history. Often described as statue-like, but 
with an element of motion, these automata serve as inspiration for the robots 
and cyborgs of current culture. These automata stretch across culture from 
Greek mythology, to Jewish legend’’, to ancient Chinese texts [244]. There is 
an almost universal appeal to these figures at the intersection between human 
and machine, a thread through history that connects us all. 

Leonardo da Vinci, the great artist-scientist, made several automatic ma- 
chines. He built a lion which automatically lurched several steps forward and 
opened its breast to reveal a collection of lilies. He also built a self-propelled 
cart (not unlike the self-driving cars of today), and a mechanical knight.!8 da 
Vinci is a notable connection in this context as these automata are not the 
only automatic theatre technology he designed and built. His journals are 


full of engineering notes on stage craft and preparations for large theatrical 





‘https: //goo.g1/F1K6Fd 
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displays for parties [135]. 


2.3.2 Cyborg Theatre: Stimulation Over Simulation 


Cyborg theatre places robots, automata, and learning machines on the stage 
during a production [252]. While there are many delightful inspirations which 
stem from the world of cyborg theatre, there is much to suggest that robots on 
stage, like children or animals, attract all the attention of audiences and might 
steal the show from human actors when performing on the same stage [192]. 
Placing a cyborg on the stage confronts an audience with several challenging 
themes, most notably the benefits and consequences of creation. The creation 
theme progresses toward two paths: 1) the replacement of ourselves and each 
other, and 2) creating something that can not be contained. 

Thus, cyborg theatre finds itself in a cultural context envisioning human- 
machine encounters as problematic and dangerous [252], likely due to fear. 
The fear of not being good enough, not being apt for the future, and the 
fear of being replaced by someone or something superior [99]. The creation 
of a human from clay is a theme that occurs across religions and cultures 
(e.g. Prometheus [111] and the golem from Jewish folklore {133]). The de- 
scriptions of the golem could just as easily be referring to a computer: ‘if 
commanded to perform a task, they will follow instructions literally’. As Ada 
Lovelace described in 1842, ”the Analytical Engine has no pretentions what- 
ever to originate anything. It can do whatever we know how to order it to 
perform [32].” The notion of obedient machines, and the hubris which ac- 
companies them, also appear in Mary Shelly’s “Frankenstein; or, The Modern 
Prometheus” [307], James Cameron’s “The Terminator” [38], and in Karel 
Capek’s “R.U.R (Rossum’s Universal Robots)” [40]. Notably, R.U.R. is the 
1921 play which coined the term robot. 

There is a tension between provoking and pleasing the creator, and while 
cyborg theatre is filled with fear, it is also filled with a unique sense of love 
for metamorphosis. Such is the tension underlying the ancient Greek myth of 
Pygmalion who fell in love with one of his sculptures [234]. Pygmalion, and 


the George Bernard Shaw play of the same name [306], are the inspiration 
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for our first attempt at creating an artificial improvisor, Pyggy, described in 
Chapter 3. This adoration of transformation is hopeful. It is hopeful that we 
humans can affect change and, in doing so, may be changed ourselves. Modern 
stories explore the themes of bringing a creation to life, the notion of bringing 


one of these systems onto the stage is another question. 


2.3.3 Live Theatre with Chatbots 


What does live theatre mean? Does it mean that the performers on stage 
are alive, that the audience is alive, or both? How much liveness must there 
be for it to be live theatre? A ventriloquist act would be considered live- 
theatre performed by a human and a puppet. But, what if the puppet acted 
independently with no ventriloquist? What if there is a voice that can interact 
with the performers on stage? What if there is nothing but a voice on stage 
that can interact with an audience? Imagine a stage is full of acting bots and 
an audience full of reacting bots. With no living beings present, would this be 
considered live theatre? 

It is the extremes of this line of thought permeated Auslander’s mind when 
he stated “the existence of chatterbots reopens and re-frames the question of 
liveness at a fundamental level” [13]. The notion that a dialogue system, 
capable of conversation has a unique liveness to it that eludes definition. In a 
way, it is creating something from nothing. Similar to teaching a machine to 
do something it previously could not. These thoughts serve as the undertone 
for cyborg theatre with conversational dialogue generation. 

As described in the history of chatbots in Appendix B, ‘ELIZA’ is often 
attributed as the first chatbot. It was the creator, Weizenbaum, who described 
‘ELIZA’ as a machine that simulates humanity by encouraging the dialogue 
partner to improvise: “in a sense, ELIZA was an [actress] who ordered a set of 
techniques but who had nothing of her own to say. The script, in turn, was a 
set of rules which permitted the actor to improvise.” [345]. The astute reader 
will notice the name ‘ELIZA’ was chosen because it was also the name of the 
lead character in G.B. Shaw’s “Pygmalion” [306]. As Weizenbaum cleverly said 


“the program could be taught to ‘speak’ increasingly well, although, also like 
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Miss Doolittle, it was never quite clear whether or not it became smarter” [110]. 

An important questions to ask while on this line of thought is ‘why?’ Why 
would we bring a machine onto the stage? Why would we want a cyborg co- 
creator in creative endeavours? This section aims to address the ‘why?’ It 
is human nature to create in our own image. It challenges our thoughts and 
feelings about what is real. Most importantly, the machine’s “creativity” is 
seen in the kind of suggestions the computer makes to humans; the system 
serves the creator, it is both pleasing and provoking [151]. As described in 
Section 2.1, the ultimate goal is to bring together human and machine con- 
tributions synergistically. The performance on the objective function will be 
better working together than either contributor could reach alone. Creativity 
emerges through from the process of working interactively [182]. 

Theatre represents an ideal experimental setting for robotics and machine 
intelligence research [33], [157], [195]. The theatrical stage is a middle ground 
between controlled environments and complex novel environments. Routines 
and rehearsal are natural in the theatre, but no performance is strictly the 
same twice. Improvisors are masters of co-creation, specialists in processing 
misunderstandings, re-contextualizing to find middle ground, covering com- 
munication breaks, and making ideas connect [147], [192]. Magerko et al. 
(2009) formulated the theoretical processes needed for a computational sys- 
tem for improvisation [200]. That study reports several important aspects of 
cognition in improvisation: basic cognition, shared mental models, narrative 
development, and referent use, and frames improvisational theatre as real-time 


dynamic problem solving [146]. 


2.3.4 Interactive Machines in Theatre and Comedy 


There is a growing intersection of researchers and artists exploring interac- 
tive machines in theatrical productions. Interactive robot theatre has been 
explored as a test-bed for sociable robotics [33]. Theatre robots are unique 
in their performances and often tailored to the audiences and productions in 
which they appear. That said, there is much in common between the systems. 


They are often built of three major components: 1) a main control software, 
44 


2) a perception stream (i.e. visual, auditory, tactile, etc.), and 3) an expressive 
embodiment. With these three components, we can describe many different 
systems which have been used in performances and research papers of the last 
few decades. 

Robotic performances have been explored previously [33] and are discussed 
in Section 2.3.4. In 2000, Tom Sgorous performed “Judy, or What is it Like 
to Be A Robot?” .19 

The silicon-based comedy, robot entertainer work of Dr. Heather Knight 
has been featured in many high profile media outlets [158]. Knight used 
a visual perception stream to track audience responses and a joke control 
system selected the next joke to be similar to those that the audience had 
responded well to. The expressive embodiment was the Aldebaran (SoftBank) 
Nao robot.?! 

Katevas et al. (2014) programmed and scripted a life-sized humanoid robot 
(the RoboThespian™ from Engineered Arts?) as the embodiment to perform 
a stand-up routine [153]. During the performance gesture and gaze were mod- 
ulated and the affects on the audience were measured. Notably, they found 
that simulated eye contact between human and robot increases positive hu- 
man audience response. They also present findings that people responded 
more positively to human comedians than the RobotThespian™. 

RoboThespiant™ also made an appearance in Jon Welch’s “Spilikan, a love 
story” (2015) by the Pipeline Theatre. All activities of the robot were pre- 
programmed and controlled by software running backstage. Timing was flexi- 
ble based on cues from a controller. 

Hiroshi Ishiguro’s lifelike android embodiments were featured in two theatre 
pieces: “I am a worker” (2008) and “Sayonara” (2010) [263]. Both collabora- 
tions between Ishiguro and director Oriza Hirata, they aimed for the Japanese 
concept of “son-zaikan” or the feeling of a presence of a person. Ishiguro ex- 


cels at lifelike embodiments and the control of the robots was pre-programmed, 





nttp://sgouros.com/judy/ 

0nttps: //www.ted.com/talks/heather_knight_silicon_based_comedy 
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rehearsed, and timed. It is unlikely these androids had any perceptual stream. 

Duncan et al. (2010) performed Shakespear’s Midsummer Nights Dream 
with several of the fairies played by robots and surveyed the audience about 
believability [71]. In 2014, Carnegie Mellon University’s Personal Robotics 
Lab and School of Drama collaborated to stage a production of David Ives’ 
“Sure Thing” with HERB [361]. The robot was controlled live on-stage by 
a human in plain sight of the audience. More interactively, the robot Myon 
made an appearance in “My Square Lady”, a 2015 opera at the Komische 
Oper Berlin.?? Myon was able to improvise by responding to the live situation 
with small gestures [171]. 

These systems serve as a backdrop for the chapters to follow and each of 
them provide valuable lessons about bringing a robot to the stage. Through 
engineering efforts, robotic platforms are becoming less expensive, more ro- 
bust, responsive, and powerful. These qualities are inspiring a generation of 


creative artists and scientists to build novel interactions. 


2.3.5 Creative Interactive Machine Learning 


Creative interactive machine learning (IML) is an applied subfield focused on 
with creative artistic expression. This growing field has seen artists from di- 
verse disciplines adopt and adapt cutting edge ML science to extend human 
creativity. There has been incredible leaps made in machine generated vi- 
sual art. Notably, the “Portrait of Edmond Belamy” (2018) created by GAN 
(Generative Adversarial Network) was sold for $432,500 a Christie’s auction 
in October 2018.24 While visual art can dominate coverage in creative IML, 
there exists a growing number of creative technologists integrating these sys- 
tems into their non-visual contemporary work. 

These include designing real-time expressive interactions in a variety of 


5 


media including: story, joke and poem co-creations”’; musical interaction?® 





3https://www. youtube. com/watch?v=jWBN9627mSg 
*4nttps://goo.g1/QwéDy9 
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and production?’; interactive kinetic light projections [191]; interactive story 
tellers [275]; dynamic video game storyline generation [327]; expressive move- 
ments in stage robots [33]; and playful swarm robots [68]. There is also work 
on tuning trained models using interactive RL for musical melody and fa- 
cial recognition to enhance interactive creative doodle generation [137], [138]. 
There is exciting work which introduces tracking of emotional and narrative 
arcs in text, audio, and video [55]. This tracking can be used to shape sequence 
generation as we present in Chapter 5. 

Creativity is a subjective quality assigned by the observing audience based 
on societal values [149]. As Margaret Wolfe Hungerford said in 1878, ‘beauty 
is in the eye of the beholder’. Creative sparks from these IML systems ignite 
when curated by humans. Artists continue to embrace these technologies to 
further develop and present explorations of the human condition. When asked 
what the end goal is for work in artificial improvisation, we joked: “when 
two robots are on stage, performing theatre for an audience of robots, we'll 
know all this work was done vain.”?° These systems do not aim to remove 
the human. They exist to augment human capabilities and creativities, and 


to inspire humans to create novel interactions. 


2.4 Interaction, Interlocution, Improvisation 


This background section provides relevant supporting material and informa- 
tion in the three areas at which thesis intersects. It covers an overview of the 
fields of IML, conversational dialogue systems, and improvisational theatre. 
It details how creative, interactive machine learning will continue to develop. 
It posits that the setting of improvisational theatre is the ideal test bed for 
experimentation with conversational IML systems. 

Through the following chapters, we formalize the challenge of improvis- 
ing alongside artificial intelligences, explore the imitation game using ma- 


chine learning, improve collaborative narrative generation using information- 





27https://www.flow-machines.com/ 
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theoretic methods, and present a method to synthesize knowledge to generate 
coherent directed plots. In these chapters, we evaluate how collaborative sys- 
tems are perceived by performers and observing audiences. We present chal- 
lenges and improvements along the iterative process. Finally, we synthesize 
and analyse critical reception and present future directions to use humour-in- 
the-loop to improve human-machine interaction for improvised theatre with 


interactive machine learning systems. 
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Chapter 3 


Improvised Theatre Alongside 
Artificial Intelligences 


I conceived of an instrument that would create sound without using any 
mechanical energy, like the conductor of an orchestra. The orchestra plays 
mechanically, using mechanical energy; the conductor just moves his hands, 


and his movements have an effect on the music artistry. 


— Lèon Theremin, 1989 


In this chapter, we present the formalization of artificial improvisation. 
This chapter builds on the background material in the preceding chapter by 
addressing the grand challenge of human-machine conversational dialogue in- 
teraction in improv. Human-machine improvised theatre is a suitable and 
challenging testing environment and application for interactive conversational 
dialogue experimentation. As a preliminary evaluation of this application, this 
chapter details two intelligent machines trained to perform live shows along- 
side human actors for human audiences. First presented is Pyggy, a rule-based 
system which uses fuzzy string matching and information retrieval methods 
optimized for speed and efficiency. Second is A.L.Ex. (the Artificial Language 
Experiment), a corpus based neural response generation system using recur- 


rent neural networks and optimized for maximum likelihood estimation. Once 





This chapter is adapted from Mathewson KW and Mirowski P. (2017) Improvised 
Theatre Alongside Artificial Intelligences. 13th AAAI Conference on Artificial Intelligence 
and Interactive Digital Entertainment (AIIDE’17). Snowbird, USA. October 2017. 
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these systems are trained they are evaluated through performance of theatrical 
improvisation alongside human performers. This chapter presents details on 
the methods underlying both systems. The evaluative results and discussions 
and conclusions support the innovations in the following chapters. 

This chapter presents the iterative development of artificial improvisation, 
or improvisational theatre performed live, on-stage, alongside an artificial 
intelligence-based improvisational performer. As introduced in the previous 
chapter, the artificial improvisor is a form of artificial conversational agent, 
or chatbot, focused on open domain dialogue and collaborative narrative gen- 
eration. Using machine learning techniques spanning from natural language 
processing and speech recognition to reinforcement and deep learning, chatbots 
have become more capable of common conversational tasks. Work in conver- 
sational agents has been focused on goal-directed dialogue focused on closed 
domains such as appointment setting, bank information requests, question- 
answering, and movie discussion [10]. Natural human conversations are seldom 
limited in scope, jump from topic-to-topic, and are often laced with metaphor 
and subtext [145]. Face-to-face communication is supplemented with non- 
verbal cues [214]. Live improvised performance can be framed as a natural 
conversation task where multiple performers collaborate to generate narrative 
in real-time front of an audience. In improvisation the topic of the conversa- 
tion is often given by the audience several times during the performance [147]. 
These suggestions inspire actors to perform novel, unique, and engaging scenes. 
During each scene, actors must make rapid fire decisions to collaboratively 
generate coherent narratives. We introduce Pyggy and A.L.Ex. (Artificial 
Language Experiment), the first two artificial improvisors, each with a unique 
composition and embodiment. This chapter highlights research and develop- 
ment, successes and failures along the way, celebrates collaborations enabling 
progress, and presents discussions for future work in the space of artificial 


improvisation. 
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3.1 Introduction 


Improvisational theatre, or improv, is the spontaneous creation of unplanned 
theatrics, often performed live on-stage in front of an audience. Improv is a 
form of collaborative interactive storytelling, where performers work together 
to generate novel narratives. It is grounded in the connections between the 
performer(s) and the audience. Improv requires the performers to work as a 
team. The actors must rapidly adapt, empathize, and connect with each other 
to achieve natural, fluid collaboration. To truly excel at the art-form, perform- 
ers must think and react to audiences reactions quickly, and work together to 
accept and amplify each other’s offers—an act that can be seen as real-time 
dynamic problem solving [200]. Improv demands human performers handle 
novel subject matter through multiple perspectives ensuring the audience is 
engaged while progressing narrative and story. Due to the incredible difficulty, 
improvisors must embrace failure and surrender to spontaneity [147]. 

Improvised theatre has been a platform for digital storytelling and video 
game research for more than 20 years [112], [256]. Past research has explored 
several knowledge-based methods for collaborative storytelling and digital im- 
provisation [199], [246], [310], [863]. Similar work explores how humans inter- 
act with system which improvise music and dance [125], [326]. Computer aided 
interactive storytelling has been considered for applications in video games 
with an aim to create endless narrative possibilities in video game universes 
for user engagement [275]. 

Scripted robotic performances, in comedy and drama, have been explored 
previously and are described in detail in Section 2.3.4. In previous perfor- 
mances, robots were precisely choreographed, deterministic, or piloted on 
stage. These shows required the audience to suspend disbelief and embrace 
the mirage of autonomy. These performances verge ever closer to the deep 
cliffs surrounding the uncanny valley—the idea that as the appearance of a 
human-like robot approaches, but fails to attain, human likeness, a person’s 
response would abruptly shift from empathy to revulsion [236]. 


This chapter presents a bridge between the artificial intelligence labs and 
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improvisational theatre stages. The bridge crosses over the uncanny valley, 
toward a future where humans and autonomous agents converse naturally to- 
gether. Our work is partially inspired by the narratives behind George Bernard 
Shaw’s “Pygmalion” [306], Mary Shelly’s “Frankenstein” [308], and Alan Jay 
Lerner’s “My Fair Lady” [62]. In these stories, creators attempt to design 
and build reflections of themselves, fabricating their respective ideal images of 
perfection. These themes are discussed in detail in Section 2.3.2. 

We present methods underlying the systems which power the first two 
artificial improvisors. We concisely report on findings, and discuss future work 


at the intersection of artificial intelligence and improvisational theatre. 


3.2 Methods 


This section describes the methods underlying Version 1: Pyggy, using clas- 
sic machine learning and deterministic rules, and Version 2: A.L.Ex. which 
uses deep neural networks, advanced natural language processing, and a much 
larger training dataset. While improvisational theatre is a complex art-form 
mixing dialogue, movement, and stagecraft, and there exist many improvisa- 
tional rules for the novice improvisor [147], in this chapter we focus on a single 
component: training the dialogue system. An artificial improvisor dialogue 
system is composed of three major building blocks (Figure 2.1): 1) speech 
recognition and natural language understanding (NLU), 2) speech synthesis 
and natural language generation (NLG), and 3) a dialogue management sys- 
tem with access to external knowledge bases. The three modules comprise 
a simplified framework, inspired by the General Architecture of Spoken Dia- 
logue Systems, for extemporaneous dialogue systems [258]. We detail these 


components for both Pyggy and A.L.Ex. 


3.2.1 Version 1: Pyggy — Rule Based Generation 


Pyggy, short for Pygmalion [306], is the first version of a chatbot-based ar- 


tificial improvisor.! Pyggy was built using speech recognition powered by 





'http: //korymathewson.com/building-an-artificial-improvisor/ 
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Figure 3.1: High level system diagram of Pyggy. 





Google Cloud Speech.” Speech recognition translates sound waves from hu- 
man voice, to text through a network-dependent application programming 
interface. Pyggy used Apple Speech Synthesis for translated output text to 
sound. This allowed for multiple synthesized voices to be used. Dialogue man- 
agement was handled with Pandorabots? and Chatterbot open-source software 
packages.4 

For each human improvisor’s verbal offer, each of these systems would 
generate a response, which were then concatenated and output to the user. 
Pandorabots handled hard-coded rules and deterministic responses. For ex- 
ample, when the human offered: “Let’s start improvising”, the system would 
always respond: “Ok”. Pandorabots also handled saving named entities. For 
example, if the human said: “My name is Lana” then the system could an- 
swer the recall question “What is my name?” with the correct answer: “Your 
name is Lana.” Chatterbot was introduced to handle open dialogue genera- 
tion and add randomness to the system. Chatterbot was pre-trained on a set 
of dialogue, as described below, and then “learned” based on responses the 


human gave back to the system (similar to Cleverbot, described in Appendix 





*https://cloud.google.com/speech/ 
3http://www.pandorabots.com/ 
“nttps://github.com/gunthercox/ChatterBot 
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B.7). It works by fuzzy matching the input string with strings in its dictionary 
and then producing the closest match to that input. Fuzzy matching uses the 
Levenshtein distance (or edit distance) as a metric to measure the distance 
between two sequences.” The Levenshtein distance is the minimum number 
of single-element edits (i.e. insert, delete, substitute) required to change one 
sequence a into another b. Formally, levg,(n,m) is the distance between the 


first n elements of a and the first m elements of b: 


max(n, m) if min(n,m) = 0, 


levaa(n —1,m)+1 
leva a(n, m) = i 





min 4 levap(n,m — 1) +1 otherwise. 
levap(n — 1, m — 1) + L(a;#b;) 
(3.1) 

Where 1(q;45,) is an indicator function equal to 0 when a; = b; and other- 
wise equal to 1. 

Pre-training of Pyggy was done through an interactive website where in- 
dividuals could directly interact in basic chit-chat dialogue. Unfortunately, 
when the general public had the ability to interact with Pyggy many of them 
started to act adversarially and mischievously, training the system to say rude 
and inappropriate things. Once the compiled training set was cleaned and fil- 
tered, it was small (less than 100 lines). Thus, additional clean training data 
was appended from the Cornell Movie Dialogue Corpus [65]. The dataset is 
composed of 220579 conversational exchanges from 617 movies and provided 
the system a larger corpus of novel, interesting, and appropriate dialogue to 
pull from. 

Pyggy is embodied by a visualization as seen in Figure 3.1 and Figure 3.5. 
The dynamic image-based visualization of Pyggy was accomplished with Magic 


6 The simple animation system controlled the movement 


Music Visualizer. 
of the mouth. Mouth movement was modulated by the amplitude of the 
synthesized sound during speech generation system. This physical embodiment 


gave Pyggy an animated face on stage. 





°https://github.com/seatgeek/fuzzywuzzy 
Snttps://magicmusicvisuals.com/ 
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Figure 3.2: High level system diagram of A.L.Ex. 


3.2.2 Version 2: A.L.Ex. — Corpus Based Generation 


There were limitations to the dialogue which Pyggy could produce, as it was 
restricted to the set of sentences present in the training and interaction data. 
The system was crude in this sense, recalling the most likely response to any 
input from the human. As well, Pyggy had no means by which to understand 
or track the topic of a scene. These limitations prompted us to explore a less 


constrained word-by-word generation approach. 


Automatic Language Generation in Improvised Theatre 


The very nature of improvised theatre relies on spontaneous generative con- 
versational abilities. Improvised theatre training relies on teaching the actors 
games which force them to perform fast-paced word associations (e.g., “electric 

.. car ... company”) or sentence completion [147] without over-thinking any 
of their decisions. During these word generation games, spontaneity is encour- 
aged and failure (e.g., a non-grammatical choice of word, an onomatopoeia 
instead of a word, or simply a made-up, garbled word suggestion) is tolerated 
and celebrated. By celebrating failure, improvisors actively reinforce spon- 
taneity and liberate the creative process [314]. Some of the games directly 


draw on the Surrealists’ Cadavres Exquis idea of taking turns in collaborative 
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Figure 3.3: Context and input to Artificial Language Experiment and unrolled 
single output decoding. 


art generation and require the players to build coherent narratives (further 
details in Appendix D). 

Even challenging improvisational exercises in musical improv rely on the 
faculty of spontaneous text generation. In this form, performers follow the 
rhythm and tune of an accompanist while improvising rhyming song. Many 
musical improv teachers and freestyle rap artists recommend not to prepare 
rhymes in advance.” Rather, they encourage starting lines without predeter- 
mined ideas of what rhyme can be found, and let the rhymes arise organically 
in the mind of the improvisor. 

While the word generation process is destined to be spontaneous, it is not 
intrinsically random. Improvisors use their cultural background, their literary 
and pop-culture knowledge, eloquence skills, and vernacular, to generate se- 
quences of words which seem most obvious to them. Each line is statistically 


likely to occur given the context of the improvisation. 


Text Generation with Neural Language Models 


We imitate the creative process of improvisation using a statistical language 


model that can generate text as a sequence of words. While building an open- 





Thttp://www.theshowstoppers .org/ 
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domain conversational agent able to converse at human-levels would require 
artificial general intelligence, a generative dialogue system that is conditioned 
on previous text and that mimics collaborative writing could give to the audi- 
ence an illusion of sensible dialogue [114]. The need for generative dialogue and 
language models required shifting from the rule-based, deterministic learning 
systems of Pyggy to deep neural network-based language model which could 
generate sentences word by word. There are benefits to developing our own 
neural language model. It allows us complete control over the training data, 
training process, input processing, response generation, and deployment of the 
system. These are critical for building systems for live performance, where we 
require the ability run—and sometimes debug—a model live, on-stage. 

Briefly, word-based neural language generation works as follows. Deep neu- 
ral network-based language models associate each word in an input vocabulary 
with a vector of real valued numbers, or word vectors. These word vectors can 
be initialized to random numbers before training, or they can be initialized 
with vectors which have been pretrained on previous tasks [228], [254]. These 
pretrained vectors contain some information how each word is used in context, 
and how word information can be related in vector space. In vector space, 
relationships such as analogies or opposites can be modelled [228], [254]. By 
using these word vectors as the inputs, language modelling can retain some of 
this relationship information. These vectors are used as the input for recur- 
rent network models. Additional details on these models is detailed in Section 
2.2.3. 

A.L.Ex (Artificial Language Experiment) is built using recurrent neural 
networks (RNN) with long-short term memory (LSTM) [124], [227] as de- 
scribed in Section 2.2.4 and shown in Figure 3.3. Contrary to similar work in 
text generation [101], [317], we decided to use word-based RNNs. This facil- 
itates curating the vocabulary produced by the dialogue system by replacing 
or removing offensive words generated by the LSTM. 

We experimented with multiple LSTM architectures with the goal of build- 
ing a dialogue model that can handle the topics within an improvised scene over 


dozens of exchanges between the human and the AI. All models were trained 
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by optimizing for perplexity. Recall that perplexity relates to the likelihood 
of sampling a sentence with a language model trained on a particular corpus 
(Equation 2.30). We started from a first version consisting of 100,000 linear 
input word embeddings and a two-layer LSTM with 256 hidden units followed 
by a softmax over 100,000 output words. This model had many missing input 
words and created dull responses with no notion of topic. The second version 
improved consistency by including topic vectors. It contained 4 layers of 512 
hidden units of LSTMs and extra 64 inputs to the first LSTM, coming from a 
Latent Dirichlet Allocation [25] topic model (described in Appendix C). This 
design enables the language model to integrate long-range dependencies in the 
generated text and capture the general theme of the dialogue [231]. Design 
choices followed details from Mikolov and Zweig (2012) [229]. 

In the second version there were still many unused output words. The top 
50000 words account for 99.15% of all words in the input data. The third 
version used pre-trained word embeddings (GloVe) [254] as inputs, resulting 
in a larger vocabulary of 250,000 input words—the GloVe word embedding 
matrix was considered as pre-trained and stayed fixed over the training—and 
only 50,000 output words. With fixed embeddings and a smaller output space, 
convergence was sped up. The fourth version cloned the 4-layer LSTM into 
a query embedding module and a response generating module in a SEQ2SEQ 
architecture [156], [318] with an attention model over the query embedding 
vectors [302]. With more trainable parameters, the fourth model increased 
the size of the stored model, as well as the loading time. With only a small 
performance boost, it was decided that the third model would be the opti- 
mal trade-off of model size and response performance. This was particularly 
important for the speed required for live performance, described in detail in 
Section 4.6.4. 

A.L.Ex. is a neural language model which uses continuous word embed- 
dings to predict sequences sequences of words. The neural network-based lan- 
guage model is constructed as described as above and trained as a probabilistic 
classifier to predict a probability distribution over a vocabulary of words given 


an input linguistic context. This is done through stochastic gradient descent 
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using backpropagation. Adaptive moment estimation (Adam) is used as the 
optimizer [155] with initial learning rate of 0.0002 multiplied by 0.5 every epoch 
(one epoch is a single pass through the entire set of training data). Gradients 
are clipped at magnitude 5, and a dropout value of 0.2 was used after each 
RNN layer for regularization. The model was trained to convergence over 10 
epochs. Sampling was done using a beam search with softmax sampling tem- 
perature 0.8, with a beam width of 10, by stepping forward through a sequence 
of words (300-dimensional embedding) from the model and continuing predic- 
tion until a maximum sequence length (100) or the end of sequence token were 
encountered. Utterances are returned alongside their resulting perplexity un- 
der the given language model. A heuristic for selecting the sentence with the 
lowest perplexity is used, with a discount factor based on the length of the 
sentence to encourage shorter sentences. Models were trained using a GPU 
back end (Amazon AWS EC2 g2.2xlarge) training time was 0.818s per batch of 
10 sequences of length 100. Training to convergence took just under 1 month. 
Inference is done using a CPU back end. For serving the trained model, the 
minimum configuration requires: 8GB RAM, 4VPU/CPU and 20GB of hard 
disk space. 


Dataset 


The language model of A.L.Ex was trained on transcribed subtitles from 
102,916 movies from OpenSubtitles.org, going from 1902 to early 2016.8 This 
user-contributed subtitles dataset for dialogue model training contains multi- 
ple languages and versions for each movie [340]. The data were available as 
XML files, with precise timestamps for each line of dialogue. We kept one 
English subtitle version per movie. As we noticed that subtitles tend to be 
split over time and that each change of interlocutor is marked by a dash sign, 
we processed the XML files to adjoin lines of dialogue separated by 1 sec, 
starting with lower-cased words and without an initial dash, into single lines 
of dialogue. Further processing involved correcting common spelling mistakes 


to account for the often erroneous subtitle input (e.g., substitutions of “;” 





Shttp://opus. lingfil.uu.se/OpenSubtitles2016. php 
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by “I” or vice-versa, extra spaces between an apostrophe and the contracted 
word or repetitions of letters, using a painstakingly hand-crafted set of over 
one thousand of regular expressions) and removal of such as information as 
“subtitles by ...”. The resulting files were lower-cased. After text clean-up, we 
calculated that the top 50,000 words accounted for about 99.4% of the total 
words appearing in the corpus. The resulting corpus contained approximately 
880 million tokens (including dashes). 

Data preparation followed this procedure: 

1. Download, parse, and clean OpenSubtitles 

2. Build and extract a vocabulary covering the majority of the words in the 

dataset. 

3. Find 68 sounds effects keywords and replace with special tokens associ- 
ated with synthesizable sounds (e.g. laugh, yell, whistle, music, phone, 
and catch-all sound effects). 

Match vocabulary to word embeddings. 
Remove movie scripts with low embedding coverage. 
Use Bag-of-Words for LDA topic clustering. 


Combine cleaned lines and per-movie topic vectors. 


DOe AD Se OS 


Save data files in fast access data storage ready for training (e.g. 
HDF5/JSON). 

The choice of a movie dialogue corpus, derived from movie scripts, is fitting. 
Often improv comedy actors draw on previous experience, personal culture and 
practice in their spontaneous creative process [206]. Future work might explore 
a variety of text-based data-sources including plays, short stories, transcripts 


of improvised performances, and symbolic plot points [59]. 


System Architecture 


A.L.Ex. was designed to subvert the multiplicity of connected services which 
formed the architecture of Pyggy. A.L.Ex. aimed to be an offline, standalone 
artificial improvisor. While, similarly to Pyggy, speech recognition and gen- 


eration are still performed by ready-made tools, respectively Apple Enhanced 
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I'm getting along with the guy 
at the club. He's a friend of 
mine, and he's having a party 
and he's always been filial. 





Figure 3.4: User interface with example input (top) and response (bottom). 


Dictation? and Apple Speech Synthesis, these tools are run on the same com- 
puter as the model is served without depending on an internet connection. 
The entire text-based dialogue system (coded in Lua and Torch!°), was 
encapsulated into a single program which makes system calls to speech recog- 
nition and text-to-speech, and was controlled through a graphical user interface 
(Figure 3.4) which visualizes results (i.e. both the recognized and generated 
sentences in the dialogue). The core system is extended with additional mod- 
ules; it also runs a fault-resilient server which accepts incoming HTTP/GET 
requests from client applications. These applications include software control- 
ling a humanoid robot with pre-programmed motions that are activated when 
A.L.Ex speaks (see Figure 3.6). Applications have been written for controlling 


both the EZ-Robot JD Humanoid! and the SoftBank Aldebaran Nao.!? 


3.3 Results 


There are challenges associated with testing, and quantitatively evaluating, 


open-domain dialogue systems [98], [114]. An obvious and reasonable first 





°hnttps: //support .apple .com/en-ca/HT202584 
lOnttp://torch.ch/ 
16Z-Robot https: //ez-robots.com 
nttps://ald.softbankrobotics.com/en 
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measure for qualitative assessment would be similar to that of a human im- 
provisor. That is, the audience-perceived performance level of an artificial im- 
provisor during an improvisational performance. Thus, each of these systems 
has been tested live in front of audiences between 5 and 100 people, for a total 
of 25, 7-60 minute performances between 8 April 2016 and 1 June 2017. As is 
common in improvisation, show structure and order remained largely consis- 
tent, while content varied based on audience suggestion!?. Through audience 
feedback, the system has been iteratively improved, through enhancement to 
the neural network-based dialogue system, the addition of performance props 
(e.g. audio and video user interface, robotic avatar), novel improv comedy 
games involving the machine and the humans (with optional audience partic- 
ipation), and scripted narrative elements performed by the humans. 

While human audience evaluation is realistic and comparable with human 
improvisor evaluation, there are limitations. First, while an audience is eval- 
uating the performance of the artificial improvisor in the scene, they may be 
biased by the fact that a machine is generating or performing these lines. Due 
to this bias, it might be worthwhile to compare human and machine-generated 
responses and scenes offline where the rater is blind to the source of the sides 
of the dialogue. This is often done in dialogue evaluation, but it is costly, time 
intensive, and high variance based on subjective human raters [73], [188]. As 
well, the performance of these systems can be limited at several points in the 
dialogue. Speech recognition accuracy is not perfect, and environmental noise 
(e.g. cheering/laughing audience) can influence the quality of input audio for 
speech-to-text. Finally, because of the high variability of live theatrical perfor- 
mance, the performance spaces (e.g. dark pub basements, or bright conference 
venues), the narrative elements, the technical difficulties encountered during 
performance, and iterative experimentation, it is difficult for us to provide 
quantitative assessment of the artificial improvisor in this study. 

Even with these limitations, it is important to report on the success and 
failure thus far with qualitative examples, and leave subjective assessment 


to the reader. Important future work will include large scale qualitative as- 





Shttps://youtube.com/watch?v=F jxBdMRAZ1E 
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sessment from trained improvisors for qualities of improvisation and from the 


general public. 


3.3.1 Version 1: Pyggy 





Figure 3.5: Human performing on-stage with Pyggy. The avatar of the artifi- 
cial improvisor is projected behind the human performer. 


The first version of Al-based improv, Pyggy, performed in the show Artifi- 
cial Improvisation live and in front of an audience on April 8, 2016, as part of 
an invited performance during Rapid Fire Theatre’s 2016 Bonfire Festival.!4 
While technically demanding, it was generally well received by the audience 


(Figure 3.5). Here is a short excerpt from a transcription of the performance: 


HUMAN: We’re having a picnic! Would you like cheese and 
bread? 


PYGGY: Seen one human, you’ve seen them all. 





Mnttp://rapidfiretheatre.com 
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HUMAN: Promise me, right now, we’re here listening to each 
other. 

PYGGY: For fun? Where are all your queer party friends? 

HUMAN: Why don’t you have any feelings for me? That’s why 
I had to invite you on a picnic. I wanted to get some 
time to talk to each other. 


PYGGY: I miss your vivid sense of humour. 


3.3.2 Version 2: A.L.Ex. 


Let's start the day d 


The last time | saw} 
of the station. 





Figure 3.6: Human performing onstage with A.L.Ex., personified as a robot. 
The user interface, showing speech recognition results and machine-generated 
response, is projected behind the performers. 


A.L.Ex. performed for the first time in front of a live audience on July 24, 
2016, during an improv drop-in in London (Figure 3.6). It has been performed 
more than 24 times and on two continents since. We provide below a short 
transcript from an early performance with A.L.Ex. As can be noticed, the 
dialogue system somewhat manages to keep track of the general theme of the 


conversation (i.e. dramatic, with words related to navigation and combat). 


HUMAN: Captain, the ship is under attack. The frigate is 


arriving... 
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A.L.EX: You see, I don’t feel the same way. I just don’t 
want to stay in the position to get burnt. 

HUMAN: We’re all going to get burned by a cannon fire... 

A.L.EX: While you’re in there, tell me. When I get 
dressed, I’1l go up there and see what happens. 

HUMAN: Captain, you need to get on the deck, the crew is 
expecting you to speak, to take leadership. We are in 
the middle of a war and this is an invasion of foreign 
ships. Captain, show us your leadership. 

A.L.EX: We won’t be able to navigate without them as we 
are, serving for us to act like a life of violence 
and to help us die from the most powerful life of our 


time. 


The subsequent experiments detailed in this section were performed using 


iteratively improved versions of A.L.Ex. 


3.3.3 Human-AI Collaboration through Games 


With the base system in place, we have explored several alternative formats, 
often called structures or games in improvisation [200], [275]. We explored a 
variety of formats as the context within which the artificial improvisor per- 
forms is critical toward the measure of audience perception of quality. In 
improvisation, a major tenant is to make one’s fellow improvisors shine.!° We 
created a show built on structures where both A.L.Ex. and human improvisors 


could shine. 


3.3.4 Justification Game 


The most extreme case of enhancing the stature of a human improvisor arises 
from games where the actor is confronted with ridiculously difficult challenges 
that he or she successfully overcomes [147]. One such game is called Actor’s 


Nightmare (or sometimes pick-a-line or lines from a hat). It consists of the 





http: //jimmycarrane.com/making-partner-1look-good/ 
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player intermittently picking a random line of dialogue (typically unrelated 
to the current improvisation), reading it aloud, and seamlessly integrating 
it into the scene [72].'° The humour generally arises from the improvisor’s 
skill in justifying that line of dialogue or from the line being coincidentally 
appropriate. We found that, because of the limitations of speech recognition 
and of the dialogue system in A.L.Ex, many of the human-machine interactions 


ended up following the paradigm of justification games. 


3.3.5 Multiple-choice Human-mediated Dialogue 


A multiple-choice game was the first format that we explored outside of the ba- 
sic structure of two improvisors engaging in a basic dialogue in a scenic setting. 
In this format, the system visually presented several candidate responses on a 
screen, but did not say any of the responses. Instead, an audience volunteer 
would select their preferred response and read it aloud. In this way, we were 
able to directly engage an audience member in the performance. This demoli- 
tion of the fourth wall is common in improvisation [285]. When the audience 
is invited to directly interact with the system on stage, an additional tension is 
introduced in the room: how will an untrained human react if A.L.Ex. offers 
multiple interesting candidates, and what if there are no interesting candidates 
generated? We observed that these games presented the challenge of the au- 
dience member having to share attention between the screen and the human 


improvisor and could be hampered by low energy or audience engagement. 


3.3.6 Multiple-person Games 


We explored dynamics where the AI played a single character in a scene with 
multiple humans. First, we introduced multiple humans in the same physical 
space. In this situation, A.L.Ex. plays alongside two human performers. We 
noticed that there is often a tendency for the two humans to form a ‘side’, 
acting together ‘against’ the system. Much more interesting scene dynam- 


ics emerged when we challenged one of the human performers to align with 





6nttp://stuffforactors.com/gpage3.html 
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Figure 3.7: Two humans performing on-stage with A.L.Ex. One of the humans 
is remotely connecting, adding to the complexity of the show setup. 


A.L.Ex’s character in the scene. Extending from this work, we then tried 
including the second human through a remote connection (Google Hangout, 
see Fig. 3.7). A.L.Ex. was able to interact with the physical human and the 
remote human. High resolution, low latency video conferencing continues to 
prove challenging, especially for remote performances with limited budgets, 
and we continue to explore means by which to overcome these challenges. We 
then instantiated multiple versions of A.L.Ex. in a single scene. In this way, 
we could balance the two humans on stage with two robotic improvisors. This 
presented opportunities for interesting connections and relationships as well 
as challenges. The timing of multiple Al-based improvisors can be noticeably 


different if not properly synchronized. 


3.3.7 Comparison with ELIZA 


Finally, we built an audience interaction game in homage to one of the ear- 
liest chatbot systems, ELIZA, by Joseph Weizenbaum [344]. In this format, 
an audience member is invited to the stage to discuss an ailment with an AI 
therapist played by A.L.Ex. in ELIZA mode. Details on ELIZA are included 
in Appendix B. While ELIZA is powered by relatively simple deterministic 


response rules given certain decompositions of the human’s input statement. 
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This is an audience favourite and often well received during shows. It is im- 
portant to pay special attention to this note, as the holistic performance of an 
Al-improvisor should be evaluated based not only on how well it is received, 
but also on the novelty and uniqueness of the scenes it performs. 

Many games were selected to allow for clear, distinct trade-off between 
multiple improvisors within consistent settings. Often our systems fail through 
mis-understanding speech-to-text input or human-machine interruption due to 
lack of social cueing and perception. By embracing and learning from these 
failures, we will continue to innovate and experiment to better understand and 


showcase the strengths of A.L.Ex. 


3.4 Discussion and Summary 


Future work might incorporate advances in deep reinforcement learning for di- 
alogue generation [273]. Through design of reward functions, more interesting 
dialogue may be encouraged. Three useful conversational properties shown to 
improve long-term success of dialogue training are: informativity, coherence, 
and ease of answering [179]. Additional reward schemes may improve, or tune, 
the trained deep neural network based dialogue managers. Work has shown 
that reinforcement learning can be used to tune music generation architec- 
tures [139]. Rewarding linguistic features (i.e. humour, novelty, alliteration) 
may prove useful in dialogue generation [126]. 

This study focused on building a dialogue system for improvisational per- 
formance. Improv theatre is a relatively young art-form, but there exists sev- 
eral books of rules for novice improvisational training which could be useful 
for future studies [241]. Future iterations of these systems could include com- 
mon improvisational rules, such as ‘status contrast’ the ‘Yes, and...’ theory of 
accept and expand, and comedic rules, such as the ‘rule-of-three’ [147]. 

Adversarial methods for natural language are another means of explo- 
ration [180], [271]. While the results are interesting and informative, these 
works are still limited in the objective functions and evaluation criteria used 


often relying on log-likelihood scores, BLEU [251] or ROUGE [183] scores. 
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Additional evaluation metrics must be devised to score these open-domain 
dialogue systems [184], [188], [273]. 

Natural human conversations are seldom limited in scope, jump from topic 
to topic, and are laced with metaphor and subtext. Artificial improvisors of 
the future should make use of advances in artificial memory [348] and atten- 
tion models [196]. As well, humans often make use of non-verbal cues during 
dialogue. By incorporating this additional information, human(s) could both 
consciously and subconsciously inform the learning system [210]. Additionally, 
if the artificial improvisor is modelled as a goal-seeking agent, then shared 
agency could be quantified and communicative capacity could be learned and 
optimized for during the performance [262]. 

While the system is trained to perform dialogue, it is not trained to tell 
a cohesive story with a narrative arc. The addition of memory network ad- 
vancements may improve callback; additional engineering and training will 
be necessary to collaboratively build a narrative arc (as discussed in Chapter 
5). In 1928, William Cook published a book on algorithmic plot development 
which may serve this purpose, and implementations and connections have yet 
to be explored [59]. Preliminary investigation toward incorporating plot de- 
velopment into performative systems is presented in Chapter 6. 

There is a large research area devoted to computers and creativity (see Sec- 
tion 2.3.5) [218]. These advances are pushed to the horizon by artists around 
the world. These artists are building complex projects, using machine learning 
techniques to implement creative, imaginative ideas. Thought must be given 
to the interface through which humans and artificial performers interact [218], 
[219], [359]. The embodiment of the artificial improvisor has been investigated 
with Pyggy and A.L.Ex. using on-screen visualizations and robotics. Stage 
presence is critical to ensure that a live show is enjoyable and engaging. Im- 
provisational performances are not strictly conversational and often demand 
physicality from performers. The optimal means by which these systems take 
the stage has yet to be determined. 

Previous work has explored the importance of physical embodiments on so- 


cial dynamics for similar performances [157]. Collaboration between scientists 
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and creative artists will lead to innovative interactions and immersive art which 
ignites the senses. With the growing popularity of interactive mixed-reality 
experiences, as well as advances in natural language processing, speech, and 
music generation, there are exciting avenues of future investigation [8], [332]. 

Improvisational theatre is a domain where experimentation is encouraged, 
where interaction is paramount, and where failure flourishes. It allows ar- 
tificial intelligence agents to be effectively tested, and audience reaction can 
provide a subjective measure of improvement and cognizance. While this work 
may feel similar to the Turing test, an early attempt to separate mind from 
machine through a game of imitation, deception and fraud, it is much more 
than that [330]. Success will be measured by audience preference to engage in 
shows incorporating artificial improvisation and human desire to participate. 
Turing imagined a world where machinery might exceed human performance 
in some domains. It is ironic and poetically fitting that ‘live’ performance 
itself may be one of these domains. 

Board games such as chess and Go are complex, but computational solu- 
tions can be approximated. Improvisational theatre demands creativity, rapid 
artistic generation, and natural language processing. Improvisation is not a 
zero-sum game, especially as these systems learn to converse open-domain 
settings [98], [114]. Future work will continue to explore the evaluation of per- 
formance in such an open domain. Performances with artificial improvisors 
continue to spur questions and insights from other performers and audiences 
alike. A formal evaluation, with audiences observing expert improvisors inter- 
acting with the system is presented in the next chapter. We look forward to 
the distant goal of the human observer, as a fly on the wall, watching artifi- 
cial improvisors on-stage in front of a full audience of artificial observers. We 
strive to continue constructive dialogue around creative AI by connecting the 


sciences and the arts. 
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Chapter 4 


Improbotics: Exploring the 
Imitation Game 


Most of the programmes which we can put into the machine will result in its 
doing something that we cannot make sense of at all, or which we regard as 
completely random behaviour. Intelligent behaviour presumably consists in a 
departure from the completely disciplined behaviour involved in computation, 
but a rather slight one, which does not give rise to random behaviour, or to 


pointless repetitive loops. 


— Alan Turing, 1950 


Theatrical improvisation (or improv) is a demanding form of live, collab- 
orative performance built on open-ended narrative structures in ephemeral, 
imagined worlds. It is thus an ideal test bed for the development and de- 
ployment of interactive artificial intelligence (AI)-based conversational agents, 
or artificial improvisors. This chapter introduces an improv show experiment 
featuring and comparing both human actors and artificial improvisors. In 
the previous chapter we developed a deep-learning-based artificial improvisor, 


trained on movie subtitles, that can generate plausible, context-based, lines of 





A version of this chapter was accepted for publication as: Mathewson KW and 
Mirowski P. (2018) Improbotics: Exploring the Imitation Game using Machine Intelligence 
in Improvised Theatre. AAAI Conference on Artificial Intelligence and Interactive Digital 
Entertainment. Portions of this chapter are adapted from Mathewson KW and Mirowski 
P. (2017) Improvised Theatre as a Turing Test. NIPS 2017 Workshop on Machine Learning 
for Creativity and Design. 
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dialogue suitable for theatre [212]. 

We extend that work by employing that system to control what a subset 
of human actors say during an improv performance. We also give human- 
generated lines to a different subset of performers. All lines are provided 
to actors with headphones and all performers are wearing headphones. This 
paper describes a Turing test, or imitation game, taking place in a theatre, 
with both the audience members and the performers left to guess who is a 
human and who is a machine. In order to test scientific hypotheses about the 
perception of humans versus machines we collect anonymous feedback from 
volunteer performers and audience members. 

Over the course of 30 performances to a combined audience of almost 3000 
people, we have refined theatrical games which involve combinations of human 
and (at times, adversarial) artificial improvisors. We have developed specific 
scene structures to include audience participants. Finally, we developed a 
complete show structure which submits the audience to an experience inspired 
by the Turing test and observed their suspension of disbelief, which is key 
for human/non-human theatre co-creation. Our results suggest that rehearsal 
increases proficiency and possibility to control events in the performance. That 
said, consistency with real world experience is still limited by the interface 
and the mechanisms used to perform the show. We also show that human- 
generated lines are shorter, more positive, and have less difficult words with 
more grammar and spelling mistakes than the artificial improvisor generated 


lines. 


4.1 Introduction 


Improvisation (or improv) is a complex theatrical art-form modelled on natural 
human interaction and demanding constant adaptation to an evolving context. 
It has been defined as “real-time dynamic problem solving” [200]. Improv 
requires performers to exhibit acute listening to both verbal and non-verbal 
suggestions coming from the other improvisors, split-second reaction, rapid 


empathy towards the other performers and the audience, short- and long-term 
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Figure 4.1: Illustration of two Improbotics rehearsals. 


memory of narrative elements, and practised storytelling skills [147]. From an 
audience point of view, improvisors must express convincing raw emotions and 
act physically. 

Improvisational computational storytelling is a grand challenge in artificial 
intelligence (AI) as proposed in previous research [206], and formalized in 
Chapter 3. While success on the grand challenge might be contingent on 
solving open-domain conversational general artificial intelligence, there have 
been incremental scientific steps made progressing toward a unified system 
which can engage in improvised theatre in an open world [41], [105], [212], [364]. 
While these systems do not fully understand the interaction, they can, in spite 
of (or perhaps, as an improvisor would think, thanks to) their imperfections, 


fuel the creativity of the performers. 


4.2 Background 


Theatrical improvisation is a form of live theatre where artists perform “real- 
time dynamic problem solving” through semi-structured spontaneous story- 
telling [200]. Improvised comedy involves both performers and audience mem- 
bers in interactive formats (sometimes called scenes, structures, or skits). We 
present explorations in a theatrical Turing test as part of an improvised comedy 
show. We have developed an artificial intelligence-based improvisational the- 
atre actor. This artificial improvisor is a chatbot with speech recognition and 


speech synthesis capabilities. It also uses a with a physical embodiment [112], 
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[256]. We have performed alongside the system in improv shows at perform- 
ing arts festivals, including ImproFest UK and the Brighton, Camden, and 
Edinburgh Fringe Festivals [212]. Complete show listings and recordings are 
available at https://humanmachine.live. Public reaction to these perfor- 
mances are included in Section 7.2. 

Over these first 30 shows, one or two humans performed improvised scenes 
with the artificial improvisor. The performers strove to endow the system with 
human qualities of character/personality, relationship, status, emotion, per- 
spective, and intelligence, according to common rules of improvisation [147], 
[241]. Relying on a large corpus of film subtitles and custom neural network 
software for language understanding and text generation, described in detail 
in Chapter 3. We were able to produce context-dependent replies for the 
machine-based improvisor. A high level system diagram is shown in Figure 


4.2. 


> 


Physical avatar 
de (stage partner) 





Figure 4.2: Artificial Language Experiment (A.L.Ex.) System Diagram 


The system we developed aims to maintain the illusion of intelligent di- 
alogue. Improvised scenes develop emotional connections between imaginary 
characters played by improvisors. The human-like characterization of the ma- 
chine elicited attachment from audience members. Through various configu- 


rations (e.g. human-human, human-AI, and AI-AI) and different AI embodi- 
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ments (e.g. voice alone, visual avatar, or robot), we challenged the audience 
to discriminate between human- and machine-led improvisation. In one par- 
ticular game setup, through a Wizard-of-Oz illusion, we performed a Turing 
test inspired structure [18], [278]. 

We deceived the audience into believing that an AI was performing, then 
we asked them to compare that performance with a performance by an actual 
AI. Feedback from the audience, and from performers who have experimented 
with our system, provide insight for future development of improv games. 
Below we present details on how we debuted this technology to audiences, and 
provide observations collected over multiple performances. Feedback from the 
audience and from performers who have experimented with our system can 


provide insight into the suspension of disbelief required for non-human theatre. 


4.2.1 Related Work 


Research on computational improvisation often focuses on music and dance, 
and on how humans interact and co-create with artificial systems [84], [125], 
[326]. Improvised theatre has also been a platform for digital storytelling and 
video game research for more than two decades [112], [256]. Theoreticians 
and practitioners have experimented with several rule- or knowledge-based 
methods for collaborative storytelling and digital improvisation [199], [246], 
[310], [363], and computer-aided interactive storytelling has been explored in 
video game development, aiming to create near-infinite narrative possibilities 
to drive longer-term player engagement [275]. There have been previous in- 
stallation art pieces which incorporated computer simulation are interactive 


1 To the best of our knowledge, our case study describes the 


improvisation. 
first application of deep learning-based conversational agents [340] to control 
and guide unstructured improvised theatre performance of human actors. 
Robotic performances have been explored previously [33] and are discussed 
in Section 2.3.4. These shows required the audience to suspend disbelief and 
embrace the mirage of autonomy. Robot-based performances challenge the 


uncanny valley—the idea that as the appearance of a human-like robot ap- 





‘https: //www.badnewsgame.com/overview/ 
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proaches a human likeness, human responses shift from empathy toward revul- 
sion [236]. Toy-like humanoid robots have been involved in improvised theatre 
performances [200], for instance Arthur Simone’s “Bot Party: Improv Comedy 
with Robots”? and “HumanMachine: Artificial Intelligence Improvisation” 3. 
Unlike those shows, our performance does not employ robotic avatars but sends 


the Al-generated dialogue to human embodiments. 


4.2.2 Motivation 


Cinematic releases including “Ex Machina” [91], “Her” [148], and “Robot & 
Frank” [293] explore robots interacting with humans naturally in day-to-day 
life. We invite live audiences to consider such interactions in a theatrical set- 
ting. Theatre practitioners can embrace machine learning systems as tools 
explore dramatic interactions and to expand the realm of stories that artists 
can create. This aligns with our research goal of augmenting creative abili- 
ties of humans. To test the quality of this creative augmentation, we have 
developed a test-bed for theatrical co-creation which places humans directly 
alongside machines in an improvisational performance. 

In our show Improbotics, we explore how human performers could seam- 
lessly perform when a machine, or another human, provides their lines. The 
human and machine performers work together to create a single, grounded, 
narrative improvisation. We combine conceptual ideas from classic improvisa- 
tion and novel methods in machine learning and natural language processing. 
The show is inspired by improvisation game Actor’s Nightmare {72|-where one 
of the performers reads lines from a play and the other performers seamlessly 
justify these otherwise incongruous lines while progressing a narrative. This 
game is modified to incorporate previous work on improvised theatre along- 
side artificial intelligence. Specifically, this work builds on the performances 
described in the previous chapters, HumanMachine: Artificial Intelligence Im- 


provisation, and Etan Muskat’s Yes, Android?. 





*nttp://arthursimone.com/bot-party/ 
3https: //humanmachine.live/ 
‘nttps://baddogtheatre.com/yes-android/ 
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This work explores Wizard-of-Oz [18] style experimental methods that have 
been used extensively in previous human-robot interaction studies and dia- 
logue system research [74], [87], [208], [278]. Wizard-of-Oz style interactions 
with artificial intelligence controllers have been used to provide suggestions 
to actors into previous artistic works.° In these studies, humans receive in- 
puts from an external source. The source may be another human, or the 
machine learning system. Importantly, the source is unknown to the human. 
This allows for separation between the human subjects’ outputs, and the cor- 
responding inputs. Similar to Actor’s Nightmare, the controlled humans in 
Improbotics will say and justify the lines they are prescribed through emotion, 
intonation, and physicality. What sets this format apart from previous work 
is that in Improbotics the lines depend on the context of the improvised scene. 
Improvisors not fed lines work to justify as the lines are not completely congru- 
ous. These justifications aim to make the scene look and feel more natural [13], 
[61]. 

In a way, Improbotics can be seen as a theatrical Turing test [211], [330]. 
Can the performers and audience discern who is delivering lines generated by 
a human from those delivering lines from a machine? We now cover methods 


to test this question. 


4.3 Methods 


4.3.1 Architecture of the Artificial Improvisor 


We named our improvisor A.L.Ex, the Artificial Language Experiment, an 
homage to Alex the Parrot, trained to communicate using a vocabulary of 150 
words [255]. The core of A.L.Ex consists of a text-based chatbot implemented 
as a word-level sequence-to-sequence recurrent neural network (4-layer LSTM 
encoder, similar decoder, and 64-dimensional topic model inputs) with an out- 
put vocabulary of 50k words. The network was trained on cleaned and filtered 


subtitles from about 100k films. Subtitles from 100k movies were collected 





“https: //www.badnewsgame.com/ 
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from OpenSubtitles.. Additional model details are included in Chapter 3. 

Dialogue turn-taking, timing, candidate sentence selection, and sentiment 
analysis [132] on the input sentences are based on heuristics. The chatbot com- 
municates with performers through out-of-the-box speech recognition (Google 
Speech Recognition) and text-to-speech software (Apple Speech Synthesis). 
The chatbot runs on a local web server for modularity and allows for inte- 
gration with a variety of physical embodiments (e.g. parallel control of a hu- 
manoid robot). The robot shown in Fig. 4.2 was manufactured by EZ-Robot.’ 
The parallel integration allows for modularity on the robotic embodiment, we 
have explored a variety of robots (including the Nao robot from Aldebaran 
SoftBank) and visual projections as illustrated in Fig. 4.3(a). Physical robotic 
embodiments allow for synchronized speaking and movement. This synchro- 
nization is important for explicit cueing interaction with untrained improvi- 
sors, such as audience members who join the performers onstage. 

The first version of our improvising artificial stage companions had their 
stage presence reduced to projected video and amplified sound (see Section 
3.2.1). We evolved to physical embodiments (i.e. the humanoid robot) to 
project the attention of the performer(s) and audience on a material avatar (see 
Section 3.3.2). Our robotic performers are distinctly non-human in size, shape, 
material, actuation and lighting. We chose humanoid robotics because the 
more realistic an embodiment is the more comfortable humans often are with 
it; though comfort sharply drops when creatures have human-like qualities but 
are distinctly non-human [235]. The physical robot also enabled us to integrate 
well developed practices from puppetry and ventriloquism when developing 
and performing the improvised shows. These principles include looking directly 
at the puppet when it is speaking, controlling the micro-movements of the 
puppet, and showing the direct link between the human and the puppet [4]. 

The server also enables remote connection which can override the chatbot 
and give dialogue control to a human operator. Multiple human perform- 


ers use headset or hand-held microphones, connected through a mixing table 





Shttps://opensubtitles.org 
"https: //www.ez-robot.com 
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to the audio input for speech recognition. Thus, several human performers 
can perform simultaneously in the scenes. Having multiple performers, both 
human- and machine-based, enables us to explore complex status dynamics 


and 2-vs-1 relationships. 


Let's start the day di" 





Figure 4.3: (a, left) Visual and (b, right) physical embodiments of the artificial 
improvisor. 


Improbotics is a show structure created to explore the grand challenge of 
artificial improvisation [206]. The show is composed of a cast of trained human 
performers (semi-professional improvisors with at least 2 years of experience). 

The cast is broken down into four roles: Cyborgs, Puppets, Free-will Hu- 
mans, and Controllers. 

1. Cyborgs are humans who take lines via headphones from an Al-powered 

chatbot overseen by a CEO Controller; 

2. Puppets take their lines via headphone from a Puppet Master Controller; 

3. Free-will Humans are free to make up their own lines of dialogue and 

typically support the show’s narrative; and 
4. Controllers, of which there are two sub-roles: 
(a) the Puppet Master directly inputs lines for the Puppet; and 
(b) the CEO who inputs scene context into an AI system that generates 


lines of dialogue for the Cyborg. 
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Figure 4.4: Two human performers and an audience volunteer improvising 
with a robotic artificial improvisor. 


4.3.2 Typical Structure of Al-based Improvised Shows 


The artificial improvisors we developed have performed alongside human actors 
in 30 improvised shows to date. Version 1 is named Pyggy, and Version 2 is 
named A.L.Ex. The systems are described in detail in Chapter 3. Because 
the shows are improvised, every one is different. That said, there are several 
commonalities between the scenes which compose the show. 

An improvisational scene starts by soliciting suggestion for context from 
the audience (e.g., “non-geographical location” or “advice a grandparent might 
give”). The human performer then says several lines of dialogue to prime 
A.L.Ex. with dense context. The scene continues through alternating lines 
of dialogue. Often through human justification, performers aim to maintain 
scene reality and ground narrative in believable storytelling. A typical scene 
lasts between 3-6 minutes, and is interrupted by the human performer when 
it reaches a natural ending (e.g. narrative conclusion or comical high point). 

For instance, one improv game which was played in multiple shows was 
staged by setting a volunteer in a psychotherapy session with the artificial 


improvisor running the code of ELIZA [344]. This performance was presented 
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as an homage to the original chatbot creator, Joseph Weizenbaum. 

Improbotics is structured as a collection of improvised scenes. A scene 
starts by soliciting a suggestion for context from the audience [147]. This pro- 
vides performers with a novel context around which to situate the improvised 
performance, and primes the Al-system. 

The scene consists of alternating lines of dialogue, where the Free-will 
Humans provide dense context to the system (human or Al-based dialogue 
model), and the Cyborg or Puppet performers respond in return. The Cyborg 
and Puppet performers aim to maintain the reality of the scene and to ground 
narratives in believable storytelling by justifying - emotionally and physically 
- their lines. A typical scene lasts between 3 and 6 minutes, and is concluded 
by the human performers when it reaches a natural ending. The performance 
progresses over the course of 20-45 minutes. At the end of the show, the au- 
dience votes to guess who was a Cyborg, who was a Puppet, and who was a 
Free-will Human. 

Our Turing test is relatively easy to solve by an attentive audience, and 
similar imitation games have been explored previously [61]. We use the Tur- 
ing test framing instead to both draw audience engagement and to give a 
creative constraint to the performers, analyzing the experience of performers 
collaborating with interactive AI tools. Additionally, it is hard to evaluate the 
imitation game with a live audience because of deception required from each 
attendee in a controlled but public performance setting. For this reason, we 
provide the Turing test as a framework for the show though it is unlikely that 
audience members were tricked for the duration of the show. The audience 
can infer who is a Cyborg or Puppet based on typos (e.g., “We are stuck in 
the dessert?... desert!” ), spelling and grammar mistakes, lack of contextual 
consistency, and ignored salient information or timing constraints. We discuss 
these points in Section 4.6. 

We considered a baseline show Yes, Android that is different from Im- 
probotics in three aspects: 1) it relies on publicly available chatbot Cleverbot®, 


which is a general small-talk conversational agent that is trained not on movie 





Shttp://www.cleverbot.com/ 
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dialogue but on user interaction, 2) there is no Master or Puppet, and 3) no 
guessing takes place, as the audience knows in advance who is the AI-controlled 


performer. 


4.3.3 Technical Configuration 


The technology that enables remote control of a human player consists of a 
laptop computer connected to an radio transmitter, an radio receiver with 
headphones worn by the controlled performer (Cyborg or Puppet), and a pro- 
gram that allows a Controller to type either the Puppet’s lines, or context 
sent to an Al-based chatbot that will in-turn generate sentences to say by the 
Cyborg (see Fig.4.1). We used the chatbot from HumanMachine’s A.L. Ez. 
(described in Chapter 3), whose architecture is a sequence-to-sequence [318] 
recurrent neural network [124] trained on movie subtitles [340].? Full details on 
the model and technical configuration are excluded for brevity as they can be 
found in Section 3.2.2. Our model uses beam search to generate word-by-word 
a set of 10 likely candidate responses to a given input and scene context; the 
top 4 sentences (ranked by perplexity under the language model) are selected 


and shown on the visual interface. 


I love you. 


I love you very much in this life. 
Do not cry. 
I must keep my promise. 
You must have respect for me. 
Figure 4.5: Simplified interface with the input line of context in the top red 


box and four generated candidate responses in shown below in green. 


The Puppet Master Controller interface is a webpage where the operator 
types input context lines. Once a line is typed, it is converted to speech using 


in-browser (Google Chrome) text-to-speech capability. The CEO Controller’s 
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interface for the Cyborg is a webpage that enables to type a line of context. 
Once typed, that line is sent to a server running the A.L.Ex. software returns 
the top 4 likely responses. The suggestions are displayed on the webpage, 
and the CEO Controller has the choice to select as many as they want or to 
discard them and to type new context. When a suggestion is selected, it is 
immediately converted to audio via text-to-speech, and output to speakers. 
The CEO can click on multiple lines or choose to ignore them, type new input 
and wait for new candidates. As we elaborate in Section 4.6, this multiple- 
choice setup for the Cyborg suggestions enables the CEO to moderate what 
lines are sent to the Cyborg’s headphones, and to curate suggestions to help 
drive the narrative of the performance. 

All performers wear headphones, and both the Puppet and the Cyborg 
receive their lines through the same computer-generated voice through a text- 
to-speech interface. This configuration allows for a hidden set-up where the 


audience, and other performers, do not know who is playing which role. 
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Figure 4.6: Audience and Performer Presence Analysis over Cities. Compara- 
ble systems were used and then analysed by a set of performers in four cities 
(sample size shown in brackets): Yes, Android in Toronto (TOR, n = 4), Im- 
probotics in Stockholm (STO, n = 6), Improbotics in London (LON, n = 7), 
and Improbotics in Edmonton (EDM, n = 9). Additionally, audiences were 
surveyed and data is presented for each city: LON-AUD (n = 6), STO-AUD 
(n = 22) and EDM-AUD (n = 29). Data presented is the average opinion over 
respondents in each group, with the 95 percent confidence interval. 


4.3.4 Turing Test-based Theatrical Improvisation 
Games 


Starting from the transatlantic improv performance on 31 March 2017, we have 


designed the AI improv show around narratives on the Turing test, computer 
83 


hacking, scientific experimentation, friendship and loneliness, and played up 
the contrast of the two human performers whose alter egos emphasize con- 
trasting facets of computer scientists. 

The performances at the Camden and Edinburgh Fringe festivals involved a 
Turing test inspired scene conducted with the willing audience. We performed 
the scene by first deceiving the audience into believing that A.L.Ex. was per- 
forming independently (whereas the chatbot and the robot were controlled by 
a human); then we performed a second scene with A.L.Ex. In game (1), we 
explained the Turing test first, then performed the two scenes consecutively 
and finally asked the audience to discriminate, through a vote, which scene was 
machine-driven. In a different game (2), we performed the Wizard-of-Oz scene 
and then immediately asked, in character and as part of the performance, if 
the audience suspected that a human was in control of the chatbot. Feedback 
from the audience provided us with insight into the suspension of disbelief 


required for non-human theatre. We present observations below. 


4.4 Observations 


We summarize here anecdotal observations from our performance. In game (1) 
(when the audience must discriminate between two scenes viewed unbiased) 
nearly everyone identified the machine driven scene from the human driven 
scene. In game (2), approximately half the audience members believed that 
the machine was performing independently alongside human improvisor(s). 
When not forewarned about the Turing test, the audience members (of di- 
verse composition in age and gender) was convinced that the dialogue system 
understood the details of the scene and responded immediately and contex- 
tually. The propensity of this delusion is likely driven by several factors: 1) 
the context within which they are viewing the deception, 2) the lack of per- 
sonal awareness of the current machine learning abilities, and 3) emotional 
connections with the scene. 

Post-show discussions with audience members confirmed that when a per- 


former tells the audience that an AI is controlling the robot’s dialogue, the 
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audience members will trust this information. Being at an improvisational 
show, they expect to suspend disbelief and use their imagination. Most of 
them were also unaware of capabilities and limitations of machine learning 
systems. This highlights the responsibility of the scientific research commu- 
nity in communicating progress in effectively and to effectively invite public 
understanding of system capabilities. Finally, we observed that the introduc- 
tion of a humanoid robot, with a human-like voice, increased the audiences’ 
propensity to immerse themselves in the imaginative narrative presented to 


them. 


4.4.1 Evaluation 


What is the best way to test an improvised dialogue system? What is a good 
way to test an improvised dialogue system? As discussed previous chapters, 
measuring the performance of human-machine interaction is notoriously diffi- 
cult and requires investigation from both human, and machine, perspectives. 

A commonly used method of evaluating interactive performance is to ad- 
dress participants and audience during the show and after-the-fact, investigat- 
ing experience through open questions, questionnaires or focus groups [353]. 
Our system was evaluated for humanness based on evaluation criteria for nat- 
ural language dialogue systems in previous work [2]. In that work, the authors 
discuss that the evaluation of dialogue systems should be based on comparison 
with interaction with real humans: this is precisely the environment we aimed 
to create with Improbotics. 

Post-show questionnaire questions were based on a subset of the Presence 
Questionnaire [353]. These questions were originally developed for a small 
audience interacting in virtual reality domains. Questions from the original 
questionnaire were reworded or excluded if they pertained to immersive experi- 
ences inconsistent with improvised theatre. The final questionnaires presented 
to the audience and performers measured the system on the following five 
categories: 1) possibility to act, 2) realism, 3) evaluation of performance, 4) 
quality of interface, and 5) possibility to examine the performance. In addi- 


tion to quantitative survey-based evaluation, we report qualitative assessment 
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comments. Exploring subjective opinions of performers provides us with valu- 
able feedback about human-machine co-creation. These results are presented 
alongside a note that because we use opt-in audience surveys, there are likely 
biases in the data. That is, those audience members who choose to partic- 
ipate already create a subsample of the audience. Participation in the data 
collection was optional. No personal identifying information was collected. 


Performers and audience gave informed consent, and the study was approved 


4.5 Results 
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Figure 4.7: Performer Presence Analysis over Systems. Yes, Android perform- 
ers in Toronto (n = 4) used a different system than Improbotics performers 
in Stockholm, London, and Edmonton (n = 22). This plot illustrates the 
comparison of analysis between the two different systems. Data presented is 
average opinion over respondents in group and 95 percent confidence interval. 


We present here quantitative and qualitative results from experimentation 
with the Improbotics system. We have deployed the experimental configura- 
tion to three locations: 1) Improbotics Stockholm, Sweden (STO, n = 6), 2) 
Improbotics London, England (LON, n = 7), and 3) Improbotics Edmonton, 
Canada (EDM, n = 9), where n is the number of respondents. In addition 
to these locations, we also provide comparative results from performers in 
Toronto, Canada who performed in Yes, Android (TOR, n = 4). We addi- 
tionally present data collected from audience members who attended a show 
in each respective city, denoted: LON-AUD (n = 6), STO-AUD (n = 22) and 
EDM-AUD (n = 29). While audience demographic data was not collected, 


we infer that audiences in LON and STO were generally other improvising 
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performers and audiences in EDM are representative of theatre-going patrons. 
Performer and audience data from multiple cities allows us to systematically 
measure the consistency and reproducibility of the experience on the evalua- 


tion metrics defined above. 


4.5.1 Quantitative Evaluation 


The questionnaire to the performers is presented below. For the audience 
questionnaire, the wording of the questions was modified to reference “the 
performers” instead of “you”. 

1. (possibility to act) How much were you able to control events in the 
performance? 

2. (realism) How much did your experiences with the system seem consis- 
tent with your real world experiences? 

3. (evaluation of performance) How proficient in interacting with the system 
did you feel at the end of the experience? 

4. (quality of interface) How much did the control devices interfere with 
the performance? 

5. (possibility to examine the performance) How well could you concentrate 
on the performance rather than on the mechanisms used to perform those 
tasks or activities? 

Overall, the actors were satisfied with the performance despite the limited 
realism of the setup (consistent between cities) and moderate interface interfer- 
ence. We note no significant difference between Improbotics and Yes, Android. 
Improvisors from LON, who had the most rehearsals and performance oppor- 
tunities with the system, rated its realism the lowest but their proficiency with 
it the highest, judging that the system interface did not interfere significantly 
with the performance. Improvisors from EDM, who had only one rehearsal, 
had the most trouble concentrating on the performance rather than on the in- 
terface. We infer that, with practice, the system interface interfered less with 
the performance and that practice increases proficiency. Audiences rated the 
performers as having more control of the events during the performance than 


the performers. 
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Note that we do not split the responses from the performers of different 
types (Cyborg, Puppet, Puppet Master, CEO) due to the collaborative nature 
of improv and to the necessity to provide a single consistent show including 
all improvisors on the stage. Additionally, we observed that if one performer 
is limited in any way, it can affect the entire scene. 

In addition to the qualitative questionnaire, we compare the utterance 
choices that the two Controllers (Puppet Master and CEO) are providing to 
the Puppet and the Cyborg respectively with lines from a script and lines 
from human improvisors. For this comparison, we selected several linguistic 
features (e.g. syllables-per-word, words-per-sentence, proportion of difficult 
words, VADER sentiment [131], and grammatical/spelling errors) indicating 
the complexity of the provided sentences. 

While evaluating the quality of a dialogue interaction is difficult, these 
linguistic features can provide a measure of the information contained within 
each of the lines composing a dialogue. For the comparative lexicographical 
analysis we used a test set of lines from four different data sources. We analyse 
Lpuppet = 334 lines from the Puppet Master, Leyborg = 2248 lines generated by 
the dialogue system. We compare this lines with Lhuman = 410 lines from Free- 
will Human performers in Improbotics shows and Lscripp = 1675 lines from two 
published scripts. The scripts that we include for the analysis are stylistically 
diverse, and similar to scripts used for the Actor’s Nightmare improvisation 
game. The two scripts we used are Tennessee Williams’ “A Streetcar Named 
Desire” [351] and Hannah Patterson’s “Playing with Grownups” [253]. 

As seen on Figure 4.8, when comparing what the Master typically types 
to what is found in theatre scripts or what is generated by the AI, we observe 
that the Master/Puppet improvise with shorter lines, with considerably more 
grammatical or spelling mistakes (which can be explained by the time pressure 
on the Master to give lines to the Puppet improvisor) and with a slightly 
more positive VADER sentiment (likely due to the training of improvisors 
encouraged to “yes, and” by displaying positive sentiments). These results 
support the conclusions that human-generated lines are shorter when typed 


and longer when spoken. As well, human lines are more positive, have less 
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difficult words than scripts and have more grammar and spelling mistakes 


than the artificial improvisor generated lines. 
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Figure 4.8: Comparative Lexicographical Analysis by Source. To compare 
the differences in text-based features we analyse a set of sentences from four 
different sources: 1) the Puppet Master, 2) the dialogue system or AI, 3) two 
published scripts, and 4) human lines transcribed with speech recognition at 
a set of shows in Edmonton. Data presented is the average over each group, 
with the 95 percent confidence interval. 


4.5.2 Qualitative Evaluation 


In addition to the quantitative evaluation, we also asked performers to pro- 
vide feedback with the following prompt: How would you compare performing 
alongside the system as compared to performing alongside a human? 

The results from this question allow us to better understand the expecta- 
tions of the performers. Selected quotes from the professional improvisational 
performers who worked with the /mprobotics system in a variety of roles are 
presented below, grouped into themes. 

Theme 1: Improvising with the system is more work. 

e The system throws up some real curve balls which makes it different to 

performing with a human. 

e You, as a human, have to be on your toes to validate the sometimes fun 

and crazy things that the Cyborg says. 

e You get many more surprises when working with the system, which is 

wonderful. 

e It is actually quite similar to playing with a normal human [who] has a 


crazy x-factor approach to improv. 
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The system gives more “out-of-the-blue” comments, and it does not feel 
like it is cooperating with me to make a “perfect” scene. 

It took the scenes to wild places but it certainly felt like the whole time 
you’re just trying to keep it on the rails. 

If the AI was a human player it would be a quite annoying co-improvisor, 
with approx 80% screwball lines and non sequiturs. 

...it is a lot more work for me as a human to drive the scene, and that 


could be a bit lonely and cumbersome sometimes. 


Theme 2: The system cannot tell complete stories. 


If you want to tell a story, humans tend to have to pick up the arc 
and carry it through, since the Cyborg rarely brings arguably important 
characters or plot items back. 

As long as the human performers provide the improv “platforms” then 


those being controlled simply get to have fun! 


Theme 3: Forces you to be a better improvisor. 


It requires more skill and practice to be able to respond to it without being 
too beholden to it. 

It makes it very important to be open and accepting. Blocking or denying 
of any kind only makes the “uncanny valley” deeper and more obvious. 
...you have to be be more physical and [create] a reality which allows 
space for the “curve balls”, and the cyborg’s stunted performance, to 
make sense 

...you have to listen more, and drive the scene yourself, you treat your 
cyborg scene partner differently-you can’t rely on them completely 

I found it essential that the humans operating the system give performers 
enough to say; once or twice I was on stage with no lines coming through! 


Object work becomes super important in this instance! 


Theme 4: Like performing with a novice improvisor. 


It was like performing with a very new improvisor with strange impulses. 
It takes a different mind-set, like being aware a fellow performer is very 
nervous and unpredictable. 


Every improvisor is different. Any improvisor has strengths and weak- 
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nesses, just like the system. Another day another improvisor. 


4.6 Discussion and Summary 
4.6.1 Reflections from Professional Improvisors 


More than 20 professional improvisors have worked with the system and pro- 
vided their experienced analysis and critiques which are summarized above. 
Their feedback largely fell into four thematic categories: 1) improvising with 
the system is more work, 2) the system cannot tell complete stories, 3) it forces 
you to be a better improvisor, and 4) it was like performing with a novice im- 
provisor. Of these themes, two are negative (1 and 2), and two are positive 
(3 and 4). While working with the system is more work, this is largely due 
to the design of the system, to augment the humans performance. Currently, 
the system is focused on dialogue and has no notion of a “full story.” Future 
work is needed to incorporate narrative representations into the system, and 
we present two studies to this end in Chapter 5 and Chapter 6. The feedback 
that draws parallels to performing with novice improvisors is reassuring, as 
the goal of the system is to challenge the notion that “...the best improvisors 


make everyone else look great.” 


4.6.2 Deception and Problem Solving 


Improbotics is a narrative improv show, where neither the audience, nor the 
improvisors, know who is a Free-will Human, who is a remotely controlled 
Puppet, and who is an Al-controlled Cyborg. The AI dialogue system is 
controlled by the CEO Controller who follows the context of the scene and 
the narrative progression on stage, interactively producing the next line for 
the Cyborg performer. These lines are often nonsensical and add incongruity 
to the ongoing scene. The randomness of these lines was addressed directly in 
several of the participants’ responses. While the justification of these random 
offers provides fun, it can also be isolating and challenging for the human 
performers who feel they are “improvising with a beginner” and need to take 
care of the narrative progression. 
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The human Puppet Master Controller, who observes the improv scene from 
a hidden place, and who feeds lines to the Puppet via the earpiece, is tasked 
with a very difficult challenge. They need to listen to the scene and simultane- 
ously type dialogue suitable for the next line. Alternatively, as we observed in 
several performances, the Puppet Master Controller can pretend to be AI-like 
and through playful deception (e.g. generating more nonsensical or discon- 
nected lines of dialogue), introduce a wild-card into the audience’s mind. 

We desire to push the imitation game as far as possible while creating 
an enjoyable performance. Thus, we encourage the improvisors to act in the 
most natural and intelligent way. They are expected to play to the full range 
of their emotions and physicality. That said, they are also aware of the conceit 
of the show and often they can introduce intrigue in the audience’s mind by 
pretending to be more Al-like, more robotic. Through this “double-bluff” 
any performer can act as if they are the Puppet, or Cyborg. As anecdotal 
evidence, some audience members incorrectly thought that a Free-will Human 
was a Cyborg in two out of six Improbotics shows in London. 

Improbotics is a way to frame the task as an optimization problem for 
the performers where the main objective is producing an enjoyable theatrical 
performance. The secondary meta-objective, a nod to the permeable audience- 


performer relationship in improvisation, is playfully deceiving the audience. 


4.6.3 Lack of Contextual Consistency 


Through the comparison of the performances of the Cyborg, of the Puppet 
and of the classic improv game Actor’s Nightmare, we see how differently 
performers handle two types of contextual consistencies in improvised narra- 
tives. The first is broad consistency in the general theme of the improv (e.g., 
domain-specific vocabulary used in the scene). The second is fine-grained con- 
sistency in the articulation of the story (e.g., relationships between characters, 
character stance or world view). 

In the game Actor’s Nightmare, where the improvisor reads consecutive 
lines for a given character, selected from a random play, those lines are typi- 


cally consistent among themselves, but disconnected from the general theme 
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of the scene. The fun of the game derives from seeing both actors striving at 
justifying the incongruity of juxtaposing, for instance, a classical drama with a 
science fiction setting. When performing a Puppet, the performer is typically 
given lines from a trained human improvisor who listens to the context of the 
scene and types lines with both high-level thematic and fine-grained narra- 
tive consistency. Despite the best efforts of the Controller who curates the 
lines produced by the AI, the Cyborg typically gets inconsistent lines from the 
point of view of the narrative. With the topic model incorporated in A.L.Ex., 
some thematic consistency can be maintained [212]. So, the AI, when primed 
with words “ship” and “pirate”, might generate sentences about sea-faring 
and sword-fighting (for additional details see Section 3.3.2). Interestingly, this 
is the opposite of the Actor’s Nightmare, which lacks thematic consistency. 
Rather than just incorporating topic in the model, future iterations of the 
system could include additional context. For instance, models may re-inject 
scene specific content (i.e. character names, locations, relationships, noun and 
verb phrases) in generated responses are currently being investigated. To this 
end, methods of editing prototypes by retrieving and refining candidates is an 


relevant area of active research [106], [347]. 


4.6.4 Handling Timing in Improvisation 


One of the most challenging technical hurdles for human-machine co-creation is 
that of timing. Verbal interaction is defined most-notably by the characteristic 
of rapid exchange of turns of talking. Gaps between these turns are often 
as short as 200 ms in natural human conversation. Latencies in language 
processing can be on the order of 600 ms [173]. This implies that humans are 
often taking turns talking based on predictions of the next line of dialogue from 
the others in the conversation. Given this extremely short latency expectation, 
there is often noticeable delay for the Puppet and/or Cyborg. 

Our current system has a median response time of more than 2 seconds 
with some responses taking up to 4 seconds. The timing of these is seldom 
below 1 second unless we queue additional responses to a single input and force 


an interruption with a potentially out-of-context follow-up line. These timing 
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limitations are similar to Actor’s Nightmare, where one of the improvisors is 
restricted to reading lines from a script. 

Luckily, such timing latencies can be smartly hidden by proficient impro- 
visors through emotional, non-verbal, and/or physical actions. While, in our 
previous work with an automated and un-curated chatbot, improvisors would 
typically talk over a naïve robot voice responding with bad timing [212]. This 
happened significantly less often with the Cyborg or Puppet in Improbotics, 
because all the people waited their turn to speak. Moreover, Cyborgs had 
(and used) the opportunity to skip an irrelevant or outdated line. Future 
work might develop methods for better handling of timing and turn-taking. 
Poor timing is a give-away for any system imitating a human conversationalist 


in interactive dialogue [11], [61]. 


4.6.5 Human Curation of Dialogue 


Currently the system is based on the natural language generation model 
trained on movie dialogue presented in Chapter 3. While film subtitles are 
not the optimal dataset for training a dialogue system for live improvisational 
theatre, we chose this corpus to train the system because it was publicly avail- 
able (unlike theatre and movie scripts), because it contained informal, realistic, 
conversational language and because improvisors typically draw their inspira- 
tion from movies and TV series. Given that many of the movies in the source 
material are from over half a century ago, there are strong biases in the train- 
ing material toward offensive or out-of-date references. That said, without a 
set of improvised dialogue transcripts, movie dialogue is the best large scale 
corpora available for training these models. There is a need for human mod- 
eration and curation to ensure that the system is not immediately offensive. 
The current system could be improved by including automated metrics for 
offensive language detection and removal [67]. 

Improbotics is focused on developing improvised dialogue in scenic impro- 
visation. While critical to human-machine theatrical co-creation, this is only 
a small component of a larger automated story generation system. Incorpora- 


tion of automatic plot generation techniques introduced nearly a century ago 
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in [59] could augment the system with directorial abilities and event-based 
story generation [205], [206]. A preliminary step towards incorporating such 
plot generation techniques is presented in Chapter 6. 

In this chapter, we present /mprobotics, an improvised performance which 
serves as a test-bed for human-machine theatrical co-creation and can be used 
for improving computational dialogue-based system for live performance. The 
system allows for Turing test-inspired experimentation. By confronting hu- 
mans to the incongruity of machines sharing the stage with them, we can 
both create new opportunities for comedy and explore approaches to human- 
machine interaction. We presented results from three geographically unique 
locations where the system is currently being used to perform for live au- 
diences. We compared the Jmprobotics neural network-based and movie 
dialogue-trained system, with the Yes, Android baseline system, which uses 
an online, publicly accessible chat-bot. We presented quantitative analysis 
evaluating the system in five categories: realism; possibility to act; quality of 
interface; possibility to examine; and evaluation of performance. We present 
qualitative analysis from professional improvisational performers. While this 
chapter focuses on improvisation, these paradigms and results are valuable for 
research in other areas of human-machine physical and verbal interaction [169], 


[210]. 
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Chapter 5 


Information-Theoretic 
Narrative Generation 


There is no reason why the simple shapes of stories can’t be fed into 
computers, they are beautiful shapes... beautiful curves... Computers can now 
play chess so I don’t understand why they can’t digest this very difficult curve 


I am going to draw for you now. 


- Kurt Vonnegut on the Shapes of Stories [341]! 


In this chapter, we consider the challenge of designing an artificial agent ca- 
pable of interacting with humans in collaborative dialogue to produce creative, 
engaging narrative. Collaborative dialogue is distinct from chit-chat in that it 
is knowledge building, each utterance provides just enough information to add 
specificity and reduce ambiguity without limiting the conversation [320]. We 
use concepts from information theory to define a narrative arc function which 
models dialogue progression. We demonstrate that this function can be used 
to modulate a generative conversation model and make it produce more inter- 
esting dialogues, compared to baseline outputs. We focus on two antithetical 
modes of modulation: reveal and conceal. Empirically, we show how the narra- 


tive arc function can model existing dialogues and shape conversation models 





‘From his rejected master’s thesis—he considered it his greatest contribution 

A preprint of this chapter is published as Mathewson KW, Castro PS, Cherry C, 
Foster G, Bellemare MG (2019) Shaping the Narrative Arc: An Information-Theoretic Ap- 
proach to Collaborative Dialogue. arXiv preprint arXiv:1901.11528. 
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towards either mode. We conclude with quantitative evidence suggesting that 
these modulated models provide interesting and engaging dialogue partners 


for improvisational theatre performers. 


5.1 Introduction 


Designing and building computational models that generate meaningful di- 
alogue for human-interaction, in an interesting and engaging manner, is a 
challenging open problem. As personal digital assistants increase in popu- 
larity, proper conversational capabilities may allow them to provide creative, 
playful, and helpful interactions. Conversational agents can be effective for 
health-care [21], by supporting cognitive-behavioural therapy for treating de- 
pression [85], helping patients with chronic pain [230], and supporting remi- 
niscence [245]. These applications require systems capable of understanding 
and collaboration. 

What makes some dialogues more interesting than others? Interesting 
collaborative dialogue constructs knowledge iteratively [320] and depends on 
each speaker bringing information to the conversation [288]. Interestingness is 
also subjective and difficult to directly optimize via numerical methods [176], 
[338]. 

Rule-based conversational models have existed for over 50 years [344]. 
These methods are limited by hand-tuning and engineering to predict and 
handle possible inputs. Generative language models maximize the likelihood 
of an utterance (e.g. a sentence or sequence of words) [101]. These models 
can predict the likelihood of an utterance by considering the sentences as a 
sequences of words, sub-word units, characters, and/or tokens [296]. This ob- 
jective can result in generated sentences which are grammatically correct, and 
bear a semantic relationship to the context surrounding it, but lack global 
consistency [189]. 

Our work generates interesting dialogue by using a narrative arc to incre- 
mentally construct shared knowledge. A narrative arc defines evolving quali- 


ties of emotion, tension, or topic over a story [23]. We draw inspiration from 
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improvised theatre, where actors collaborate in real time to develop narrative 
based on thematic constraints [147]. Improvised theatre is a unique story- 
telling medium which relies on collaborative dialogue in which each utterance 
must carry significant information [320]. We appeal to the two golden rules of 
improvised dialogue, characteristic of interesting collaborative dialogue [147], 
[288]. Good dialogue should 1) accept (i.e. be consistent with the dialogue 
thus far and 2) reveal (i.e. progress the dialogue with new information). 

In this work, we propose a new method to modulate a conversation model, 
which accepts input utterances by generating consistent and revealing re- 
sponses. Our approach combines a conversational model with a topic classi- 
fier, or universe model. We borrow the term universe from improvised theatre 
where it is used to describe the world-as-we-know-it [147], [221], [269]. The 
universe encompasses associations surrounding the dramatic world, and is mo- 
tivated by the possible world semantics theory [162]. 

We identify two modes of operation for our shaping method: revealing and 
concealing. Revealing dialogue adds additional information about the current 
universe. Generating utterances which progress a scene with new information 
is the primary goal of our approach. Concealing dialogue avoids exposing new 
information about the universe. The ability to generate both revealing and 
concealing dialogue is a convenient side-effect of this method. 

The universe model characterizes the information revealed by each utter- 
ance in a sequence. We refer to this information profile across utterances as the 
narrative arc. By tuning the revealingness we can selectively choose utterances 
to shape the narrative arc to produce more interesting and engaging dialogue. 
We argue that a balance between revealing and concealing is required for in- 
teresting and engaging collaborative dialogue [320]. Both over-specification 
and ambiguity are undesirable [147], [288]. We hypothesize that there is an 
ideal region of information revelation which our method can expose in existing 


text-based narratives such as movie scripts. 


98 


5.2 Shaping the Narrative Arc 


In this section, we present a mechanism for shaping the narrative arc inspired 
by combining methods exploring entropy in textual documents [303] with the 
Simple Shapes of Stories described by Vonnegut.2 We describe concepts of 
conversation and universe models. Then, we show how these combine to de- 
scribe a narrative arc. Finally, we show how the narrative arc can be used to 


generate interesting dialogue. 


5.2.1 The Conversation Model 


A conversation model accepts an input utterance and generates one, or several, 
output utterance(s). The conversation model maintains local coherence by 
conditioning output generation on the input. We write ¥ to denote the set of 
possible utterances (i.e. sequences of words); in this work, ¥ is a collection of 
English sentences. A sequence of t successive utterances is a dialogue, denoted 
X14. A conversation model yields a probability q of an utterance x given a 
dialogue 214-1. 

We focus on dialogue generation using three retrieval-based conversation 
models. The first two models are based on the OpenSubtitles dataset [186]. 
Pre-processing details are included in the supplementary material. When 
queried with an input line 2;_;, a model returns K candidate responses: 

e Baseline Random model: sample K lines from 1. 

e Deep neural network model (DNN): we embed all the lines in ¥ 

into a latent semantic space S using the Universal Sentence Encoder [46]. 
We encode the input line into S, and return the K approximate nearest 
neighbours [134] in S using the L? norm as the distance metric.’ 

Similar to the DNN model, a third model (Books), responds with se- 
mantically related nearest neighbour lines from literature, filtered for offensive 


content.4 





?From K. Vonnegut lecture: https: //goo.gl1/JuEDVR 
3http://mathworld.wolfram.com/L2-Norm.html 
‘nttps://books.google.com/talktobooks/ 
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5.2.2 The Universe Model 


The universe model measures how each successive utterance of a dialogue in- 
fluences the probability distribution over universes. For a given utterance, the 
universe model calculates a probability distribution over universes. For a se- 
quence of utterances, we use recursive universe belief propagation (Sec. 5.2.3) 
to update the posterior over the course of a dialogue. Revealing dialogue 
would concentrate probability mass on a single universe, and concealing dia- 
logue would maintain posterior likelihood over a set of universes. The shape 
of this sequence of posteriors is the narrative arc (Sec. 5.2.4). We investi- 
gated reveal/conceal dynamics using three different universe models based on 
probabilistic topic classifiers. 

e Newsgroups: Using the newsgroup classification dataset, we filter out 
stop-words, created frequency vectors, and use the TF-IDF (term fre- 
quency / inverse document frequency) [286] word weighting scheme to 
account for word importance in the corpus. We train a naive Bayes 
classifier on 5 aggregate topic universes (COMPUTERS, RECREATION, 
RELIGION, SCIENCE, and TALK) [144]. 

e Movies: naive Bayes classifier, trained similar to Newsgroups, using a 
collected dataset of film synopses and one of 10 corresponding genres 
(DRAMA, COMEDY, HORROR, ACTION, CRIME, ROMANTIC COMEDY, 
ROMANCE, THRILLER, FILM ADAPTATION and SILENT FILM) from 
Wikipedia data [121]. 

e DeepMoji: Deep neural network that takes input text and outputs a 
distribution over a set of 8 aggregated emoji universes: (SAD, MAD, 
MEn, NERVOUS, GLAD, Music, LOVE, and MISCELLANEOUS) [83]. In- 


put text is not transformed, and a pretrained model is used.° 


5.2.3 Recursive Universe Belief Propagation 


We desire a means by which we can update the universe belief incrementally 


as evidence is accumulated with each successive utterance in a dialogue. We 





> github.com/bfelbo/DeepMoji 
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(a) Utterances (b) (Newsgroups) (c) (Movies) 


1: My favorite scientist and academic is 0.35 0.14 Drama 
Albert Einstein. Science 

2: His theoretical experimentations are 
still unparalleled! 

3: He obtained his doctorate from this 
University, in 1905! 

4: Sometimes, I imagine academic debates 
between Einstein and Isaac Newton... 

5: "The only component I missed was 
relativity!" 

6: "And MY special relativity extended 
your mechanics.” 0.1 | 0.08 


0 3 6 0 3 6 
Utterance Utterance 








Comedy 
Talk 


Probability 
Probability 








Figure 5.1: The narrative arcs of a synthetic dialogue (a), using the News- 
groups universe model (b) and Movies universe model (c). This dialogue is 
likely SCIENCE or TALK under the Newsgroups model, and DRAMA or 
COMEDY under the Movie genres model. 


begin by defining the notion of a universe model as a means of modelling 
the dynamics of information revelation. Consider a finite set of universes, U. 
The role of a universe model is to assess the compatibility of an utterance 
with a given discrete universe, u € U. Given such a model, we develop a 
method to update the agent’s posterior universe distribution over a sequence 
of utterances. For each universe u, the universe model assigns a likelihood 
p(x: | Viz-1,u) to an utterance z+, conditioned on a dialogue 2 .,_1. 

The universe model iteratively updates a posterior belief over universes, 
in a similar spirit to prediction with expert forecasters [48]. The probability 
of a given universe depends on iteratively combining evidence in support of 
that universe. We define the the posterior probability over universes U given 
a sequence of t utterances 271.4 as: 


plti | Cin-1,¥) 
p(x | X11) 





pilu | Laz) = Pilu | L1t-1) x 


Where p,-1(u|214-1) is prior probability, p(xi|£1:+-1, u) is likelihood of utter- 
ance conditioned on the past dialogue and universe, and p(2;|214_1) is likeli- 
hood of utterance under the conversation model. 

Let po(u|-) = 1/|U|,u € U be an initially uniform distribution over uni- 
verses, or universe model’s prior. We can marginalize out the universe if the 
evidence is consistent over all hypotheses. To illustrate the relationship be- 


tween utterance likelihood and universe, we can explicitly write the marginal 
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likelihood as: 


p(t: | L1:t-1) = So palu | L141) p(x | L14-1,U) 


Thus, the posterior is updated recursively as: 
pelu | Fix) = pihu | L14-1) 


p(x | Liz-1, U) 
yoy Piilu | Ci2-1)p(xe | it-1, U’) 





(5.1) 


In practice, it may be convenient to use the output z(u|x,;) of a probabilistic 
classifier in lieu of a likelihood function conditioned on past utterances x1. and 
universe u. Universe classifiers can be trained separately from language mod- 
els, and provide complementary signal if model input distributions overlap. 
This assumption is justified when both models work with similar training cor- 
pus vocabularies. We view the probability distribution over universes output 
by the universe model as derived from a joint distribution z(u, x+), of the uni- 
verse u, and utterance z;. With z(u) as the prior distribution over universes, 
the conditional probability is: 


z(az | u) 
2(x+) 





z(u | x) = = z(u) x 


We can substitute z(-|x,) for p(a|v14-1,-) in Eq. 5.1 by assuming con- 
ditional independence (i.e., p(x: | £14-1;, U) = p(x: | u)), uniform prior dis- 
tribution (i.e., z(u) = 1/|U|,u E€ U) and constant marginal probability (i.e., 
2(a1) = J y pelu plx | u’)). These assumptions are justified when the proba- 
bilistic topic classifier is a naïve Bayes classifier with uniform prior [22]. Thus, 


the substitution follows the following steps: 


pltil£it-1, U) & z(a,|u cond. independence] 


) 
) 


= pelle) eG) Bayes’ theorem] 
z(u) 
x z(u|xri)z(x:) z(u) uniform prior] 





) 
x z(u|x) z(az) const. marginal] 
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Eq. 5.1 thus becomes: 


pilu | L1:t) = Pilu | L1:t-1) 
z(u | x) 


x 
Xou Pilu | ti-i )elu | z) 





5.2.4 The Narrative Arc 


As defined in Eq. 5.2, the posterior p;(-) is a function of the dialogue x1... We 
define the narrative arc as the sequence of universe distributions po(-), p1(-), .- - 
iteratively calculated for the dialogue. The arc depicts the evolution of a belief 
over a set of universes. The narrative arc function maps X* + S(U)', where 
S(U) is a probability simplex over U. We discuss three properties of the 
narrative arc of the synthetic dialogue shown in Fig. 5.1: 

1. Utterances affect the arc in varying degrees. “My favourite 
scientist and academic is Albert Einstein” is similarly likely under SCIENCE 
and TALK, and less likely under the RECREATION universe (bottom green 
line). Different utterances should have different effects on p;(-). 

2. A concentrating posterior signals a revealing dialogue. 
A dialogue which emphasizes scientific content, for example, should see 
pi(SCIENCE|:) — 1. Conversely, we would expect a concealing dialogue to 
spread the posterior across multiple universes. 

3. A universe model is a perspective on dialogue. Different uni- 
verse models can expose different aspects of the same dialogue. Replacing 
the Newsgroups universe model by a Movies universe model suggests the di- 
alogue is from a DRAMA and/or COMEDY universe. This dialogue would be 
considered revealing under both universe models. 

In this section, the universe model was applied to a fixed dialogue x1.,., but 
the model also provides a criterion for favouring utterances when generating 


dialogue. 
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Narrative arc: (Newsgroups) Utterances 


es 
(=) 


Talk l: Good morrow, cousin. 
2: Is the day so young? 
3: But new struck nine. 
4 Ay me! sad hours seem long. Was that my father 
that went hence so fast? 

5 It was. What sadness lengthens Romeo's hours? 
6 Not having that which having makes them short. 
7: In love? 

Narrative arc: (Movies) 8: Out- 
9 
1 
1 


Probability 
o 
uw 


9 
(=) 


Of love? 

0: Out of her favour where I am in love. 

1: Alas that love, so gentle in his view, should 
be so tyrannous and rough in proof! 

12: Alas that love, whose view is muffled still, 
should without eyes see pathways to his will! 

___-— Romance Where shall we dine? O me! ... 

13: No, coz, I rather weep. 

: - 14: Good heart, at what? 

Narrative arc: (DeepMoji) 15: At thy good heart's oppression. 

: Why, such is love's transgression. Griefs of 

mine own lie heavy in my breast .. 

Sad 17: Soft! I will go along. An if you leave me so, 
you do me wrong. 

18: Tut! I have lost myself; I am not here: This 
is not Romeo, he's some other where. 

19: Tell me in sadness, who is that you love? 

: What, shall I groan and tell thee? 


p 
[e] 


Probability 
[e] 
u 











ro 
O o 
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Love 





Probability 
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Utterance 


Figure 5.2: First 20 lines of Romeo and Juliet modelled with Newsgroups 
(top), Movies (middle), and DeepMoji (bottom) universe models. 


5.2.5 Generating Dialogue with the Narrative Arc 


The entropy of the posterior p;(-) is given by: 


A (p,(-)) = — Xp) log p;(-) 


ucu 


Then, the entropy change A(-) due to a new utterance, x, given the past 


dialogue, x1.4—1, is defined as: 


A(z; 212-1) = H(p:-1(-)) — H (p(-)) 


The term A(a;; 214-1) measures how much a given utterance x; changes the 
entropy of the posterior, given the previous utterances 7 ;.,_;. A positive value 
of A(-) is a reduction in entropy (i.e. revealing). Conversely, a negative value 
of A(-) is an increase in entropy (i.e. concealing). We define the score of an 


utterance x+, with respect to a dialogue, 21.,-1, as: 





o(@3 Liz—1) = exp{aA(a,; 214-1)}, QER 


The exponential function is a convenient way to ensure strict positivity and 
preserve the ordering of scored candidates. We use our entropy-based score 


function o to modulate the sampling of a base conversation model, q, toward 
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q, which depends on the change in entropy due to the new utterance. 
G(X | Lit-1) X Q(t | Lit-1) X O(@t 5 Lit-1) (5.3) 


If a = 0, o(-) = 1 and candidates are sampled according to q = q. If a 4 0, q 
is modulated by the score o(-). Modulation mode depends on the value of a: 
e a > 0 (reveal): modulate q towards revealing the universe. The prob- 
ability of utterances likely under the universe with highest probability 

are increased. 

e a < 0 (conceal): modulate q towards concealing the universe. The 
probability of utterances likely under multiple unlikely universes is in- 
creased. Utterances not supporting the likely universe are made more 
likely. 

We use these two modulations for filtering samples from our base conver- 
sation model. We filter via one of two methods for sampling from an unnor- 
malized distribution: greedy sampling and rejection sampling. Greedy 
sampling scores a set of samples from the conversation model and selects the 
candidate with the maximum score. Scoring a large set of candidates can 
be time intensive. Rejection sampling (Alg. 2) can sample from the desired 
unknown modulated distribution online [237]. Additional details on rejection 
sampling are included in the supplementary material. As the entropy function 
is bounded, the utterance score g is bounded. In practice, we set a max score 
and weigh all utterance scores ø above the threshold equally. Both filtering 
methods have benefits. Rejection sampling provides a smoother distribution 
and does not require scoring a large set of candidates. Greedy sampling is less 


sensitive to the range of A from different utterances. 


5.3 Evaluation 
5.3.1 Narrative Arc of Existing Dialogues 


In Fig. 5.2, we visualize the narrative arc underlying the first 20 lines of 
Shakespeare’s Romeo and Juliet using three universe models: 1) Newsgroups, 
2) Movies, and 3) DeepMoji. 
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Algorithm 2 Generating dialogue with rejection sampling. 
Given: conversation model q, scoring function øg, first line xı, length N, 
max score M, max samples S 
Return: dialogue z1:y 
for t in 2...N do 
while step < S do 
sample x; ~ 9(X¢|®1-1) 





sample r ~ Uniform(0, 1) 

if r < o(£tı; %14-1)/M then 
append z to 14-1 
break 





Fig. 5.2 illustrates the entropy-reducing nature of good dialogue by showing 
us the underlying, evolving, narrative arc. Under the Newsgroups universe 
model, the dialogue evolves toward a TALK-centric universe. Under the Movies 
model, the same dialogue balances between comedy and drama before shifting 
towards drama. Finally, using the DeepMoji universe model, a developing 
ambiguity between DeepMoji universes SADNESS and LOVE is uncovered. This 
supports the hypothesis that existing dialogues exhibit underlying narrative 
arcs conditioned on universe models. Additional samples exposing narrative 


arc dynamics are presented in Appendix F. 


5.3.2 Shaping the Narrative Arc 


In this section, we demonstrate that our method is able to modulate conversa- 
tion models toward generation of revealing or concealing dialogues. Linguistic 
quality and semantic consistency of utterances are determined by the language 
underlying the conversation model. Here, we emphasize evaluation of narrative 
arc shaping. 

We use the DNN conversation model to test how preferential selection, 
induced by our score function, can modulate information introduced into the 
conversation. In Fig. 5.3 we present characteristic narrative arcs and dialogues 
using concealing (top), neutral (middle), and revealing (bottom) modes. Each 
generation was primed with the first two lines from Romeo and Juliet (shown 
in bold in Fig. 5.3). 

A significant difference is exposed between concealing (top) which tends 
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Narrative arc, alpha: negative, concealing Concealing Utterances 
Good morrow, cousin. 
Is the day so young? 
frasier's brother niles? 
frasier cranium. 
i'm frasier and you're frasier's dad. 
frasier, frasier, please. 
what about frasier? 
uh, well, what about frasier? 
how about frasier? 

: frasier, frasier... forget it. 
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Probability 
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Narrative arc, alpha: zero, neutral 





Neutral Utterances 

1: Good morrow, cousin. 
2 Is the day so young? 
3 cousin john? 

4: hubert de montmirail is your cousin? 
5 

6 

7 

8 

9 

1 


© 


is that maddie's french cousin? 
is that aunt maggie? 
is that phoebe kendall? 
is that alyssa vaughn? 
is that lexi? 
0: is that kristina? 


Probability — 








pad 
(e) 





Narrative arc, alpha: positive, revealing 





Revealing Utterances 


Good morrow, cousin. 
Is the day so young? 
good evening, cousin john. 
cousin john? 
i'm your cousin, carmen. 
your cousin billy. 
i heard he killed your cousin. 
your cousin is murdered. 
your cousin is like you. 
: she your cousin? 


(>) 








© Probability = 


[o) 
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Utterance 
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Figure 5.3: Narrative arcs over 10 utterances at increasing a values: concealing 
(top), neutral (mid), revealing (bottom). On the right are utterances generated 
by each model after priming (bold). Dotted red line indicates the start of 
narrative arc shaping. 


toward a high entropy, uniform universe distribution, and revealing (bottom) 
where drama tends toward 1.0. DRAMA remains the most likely universe (and 
visible on all plots) as it was supported by the first two lines and subsequent 
utterances did not significantly shift the distribution. Fig. 5.3 also shows the 
dialogue generated by the model. Concealing utterances do not add informa- 
tion to the dialogue, revealing utterances incorporate new information over 
the course of the dialogue. 

We next evaluate our method’s ability to generate concealing/revealing 
dialogue by measuring the entropy under both an objective universe (i.e. the 
universe model used for scoring in generation) and a test universe not used for 
scoring. We use the Newsgroups universe model for objective scoring and the 
Movies model for testing. A random conversation model is used to generate 
response candidates. 


We generate 20 conversations following a process similar to Algorithm 2 
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but using greedy sampling. Each conversation starts with a random dialogue 
starter line to encourage diversity and then 19 lines are sampled from the 
conversation model using the narrative arc function. This approximates the 
length of a medium-duration improvised conversation [288]. 

Results are presented in Fig. 5.4. There is a significant difference between 
the entropy under the objective and testing universes, but each model exhibits 
similar dynamics over the dialogues. We conclude that concealing dialogue can 
conceal under multiple universes, and revealing dialogue can reveal information 
under multiple universe models. 

The revealing/concealing dynamics of each utterance may be related to 
measurable lexicographical qualities such as words per sentence (WPS). We 
analysed the language used in 190 lines from each model and found a significant 
difference (p < 0.001) between utterances selected by the revealing model 
(9.26+5.7 WPS) and utterances selected by the concealing model (5.05 + 2.79 
WPS). 





5 (a) reveal (b) conceal 
2.0 
215 
[e] 
= 
= 1.0 
Ww —— obj: newsgroups 
0.5 ; 
—— test: movies 
0.0 
5 10 15 20 5 10 15 20 
Utterance Utterance 


Figure 5.4: Revealing and Concealing across Universe Models. Dialogue gen- 
erated to be (a) revealing (a = 20) under the objective model Newsgroups 
is revealing under the testing Movies universe. The same is true for (b) con- 
cealing (a = —25) dialogue. Data shown are means and standard deviation 
(shaded) over 20 runs of random conversation model. 


5.3.3 Predicting the Next Best Line 


We next test the system’s ability to add information to improve performance 
on a prediction task. Given a sequence of 5 gold-standard conversational 


utterances and a list of 10 next utterance candidates (i.e. the ground truth 
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and 9 distractors), can the universe model be used to improve accuracy of 
predicting the ground truth? 

Evaluation compares top-3 accuracy and mean reciprocal rank (MRR) 
over samples in a held out test set. Accuracy measures the likelihood that 
the system scores the ground truth within the top-3 candidates against the 
distractors. MRR compares average ground truth ranking across conditions. 
A text2text Transformer language model was trained on the OpenSubtitles 
dataset [186] to predict an output line given a set of input lines [335]. Addi- 
tional details are included in Appendix F. 

The trained Transformer model was used to assign a perplexity score for 
output line candidates given an input context line. For each unique subtitle 
file in the validation and test sets, the concatenation of the first 5 lines serve as 
input context and line 6 is the ground truth output to be predicted. Negative 
candidates are randomly selected from lines in the respective corresponding 
data segment (i.e. validation or test sets), thus may not be from the same 
source file as the input context lines. 

The perplexity under the trained conversation model serves as the unmod- 
ulated probability q(a|214-1) (Eq. 5.3) of selection in the prediction task. 
The input sequence is then passed, line-by-line, through a Newsgroups uni- 
verse model and a score is assigned to each candidate relative to the change in 
entropy of the evolving posterior. The œ value is modulated over 100 evenly 
spaced values between [—2, 2]. The accuracy of predicting the ground truth in 
the top-3 candidates and the MRR of the ground truth are computed. 

The results on the validation set are shown in Fig. 5.5. By selecting the 
correct a value, the likelihood of correctly selecting utterances revealing an 
incremental amount of information increases significantly. Note the shape of 
the curve as a changes. As hypothesized, there exists a region, between 0 and 
1 where the ‘right’ amount of universe information is revealed. This region 
corresponds to the notion that each line of dialogue will reveal some, but not 
too much, information about the universe. As a continues to increase, the ac- 
curacy decreases below the neutral baseline. The top-3 accuracy of prediction 


increases when the universe model boosts the probabilities of appropriately 
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CM UM Top3Acc MRR 
T2T@5 NG 0.520 0.456* 
T2T@5 Neutral | 0.507 0.444 
T2T@1 NG 0.483 0.428* 
T2T@1 Neutral | 0.469 0.412 
Unigram NG 0.366 0.337" 
Unigram Neutral | 0.296 0.290 
Random Neutral | 0.302 0.294 




















Table 5.1: Results for predicting the next line. CM is the conversation model, 
UM is the universe model, Top3Acc is the accuracy of predicting the ground- 
truth in the top-3 of 10 candidates, and MRR is the mean reciprocal rank of 
the ground truth. Unigram CM calculates the perplexity of each candidate 
given the input lines as training corpus. T2T@N is a Tensor2Tensor Trans- 
former model which uses the previous N lines as an input to predict the output 
and NG is the Newsgroups universe. A Neutral universe model represents no 
modulation which is equivalent to a = 0. x indicates p < 0.05 for a Students’ 
t-test comparing MRR to the Neutral model. 


revealing dialogue. The validation set is used to set the optimal a, which is 
then used to score samples in the test set and results are presented in Ta- 
ble 1. Two additional models are included for comparison. T2T@1 uses 1 
preceding the ground truth as context. Unigram assigns a perplexity to out- 
put candidates by building a unigram language model using the 5 input lines 
as a corpus. A smoothing factor of 1 x 1075 is used for out-of-vocabulary 
words. Additionally, a random conversation baseline model is included. For 
each model tested, information from the universe model significantly improves 


the predictive accuracy on this task. 


Accuracy top-3@10 (n=9994) Mean Reciprocal Rank of Ground Truth (n=9994) 
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Figure 5.5: Information revelation region as a varies for (left) top-3 accuracy 
and (right) MRR in universe model modulated prediction task 
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5.3.4 Interactive Collaborative Dialogue 


Finally, as a practical implementation case-study, we tested how this system 
performs in collaborative dialogue through interaction with humans. 4 expert 
improvisational theatre performers engaged with the system in 3 text-based 
conversations. Each conversation consisted of 5 utterance-response pairs for a 
total of ten utterances (i.e. an average length of a short-duration improvised 
scene [288]). Subjects are native English speakers with 5+ years professional 
performance experience and are familiar with shared narrative development 
and collaborative dialogue. Each interacted with revealing, concealing, and 
neutral models in a randomized order unknown to the them. Transcripts of 
actor-system dialogues and qualitative feedback are included in Appendix F.4. 

As described in Sec. 5.2.1, this experiment used the Books conversation 
model and the DeepMoji universe model. Following the interactions, each 
performer was asked the following question: “please rank the conversations 
from 1 (most engaging) to 3 (least engaging)”. Engagingness was defined to 
align with the notions of revealing and concealing in this work. An agent is 
engaging for shared scene development if it brings just enough information to 
add specificity and reduce ambiguity but not limit the conversation. 

Three of the four performers ranked the revealing model, a > 0, as the most 
engaging. Those three performers ranked a = 0 as being less engaging due to 
being “too random”. All subjects ranked a < 0 as being least engaging and not 
bringing enough information to the scene. These results support the hypothesis 
that œ can effectively modulate the engagingness of collaborative dialogue 
in human-machine interaction. Qualitative feedback from expert assessment 


surveys are included in Appendix F.5.2. 


5.4 Related Work 


Collaborative dialogue between humans and machines has been proposed as 
a grand challenge in artificial intelligence [34], [206], [211]. Previous meth- 
ods have used hard coded rules, decision trees, and event representations to 
generate novel narrative chains [203]. We used an deep neural network-based 
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generative language model enhanced with universe model information in the 
context of improvised theatre [212]. 

While neural response generation systems provide a trainable end-to-end 
system for language generation, these methods are prone to providing generic, 
unspecific responses |177]. Advances have improved generated responses by 
optimizing sentence encoding and decoding jointly, post-generation candidate 
re-scoring [30], [313], [340], reinforcement learning [176], hierarchical models 
for distilling extended context [300], and auxiliary training objectives, such as 
maximizing mutual information [177], and personality specificity and consis- 
tency [178], [364]. In future work, universe models and conversational models 
could be trained jointly. 

Our work is related to the controlled generation of text using disentan- 
gled latent representations [9], [129], [365]. Previous work has used a topic- 
transition generative adversarial network to enforce smoothness of transition 
of subsequent utterances [181]. These methods use neural encoder-decoders 
and generate responses given an input sequence and a desired target class for 
the response. 

Other work has aimed to improve candidates returned by retrieval-based 
conversation models [349]. These methods utilize a conversation model to find 
similar prototypes using embedding distances and refine prototypes with a 
sequence-to-sequence model [106]. We do not refine candidates from the con- 
versation model, rather we sample and select using a scoring function defined 
by the revealing and concealing parameter. 

Similar to universe models, topic models or lexical fields have been shown 
capable of tracking general subjects of a text [25], [92]. Dynamic topic models 
characterize the evolution of topics over a set of documents over time [24]. 
Our work differs in that we generate dialogue using the evolving probabilistic 
belief during a single conversation, as opposed to tracking topical shifts over 
longer time-scales. Using a probabilistic classifier for narrative tracking has 
been explored previously [232], [274]. These works used sentiment classifiers 
to track emotion and plots arcs through narratives. We extend these works by 


using probabilistic universe models collaborative dialogue generation. 
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5.5 Discussion and Summary 


While innovations have improved the linguistic quality, semantic alignment, 
and consistency of utterances generated by neural models, generated conversa- 
tions still lack interestingness and engagingness. Our work generates engaging 
dialogue by shaping the underlying narrative arc as opposed to improving the 
training of generative language models. The methods presented are agnostic 
to both the universe and the conversational model used. Using rules from 
improvised theatre, we quantitatively define the evolution of interesting and 
engaging dialogue. 

In this work we focus on genre, emoji, and topic-based universe models. 
Other universe models to be explored involve causality of events, directions 
of relationships, or audience reaction prediction [59], [76], [158], [276], [329]. 
While this work explores the interaction between a base conversation model 
and a universe model, this method could be compatible with image or video 
generation. 

The main contribution of this work is the computational formalization 
of the narrative arc, an information-theoretic framework for collaborative dia- 
logue interaction. The framework fills a gap in previous research by connecting 
the utterance-level improvements of language models with the conversation- 
level improvements of universe tracking. This is done by sampling candidates 
from a conversational model using a universe model and the narrative arc. 
We illustrate narrative arcs underlying popular dialogues and show how uni- 
verse models can be combined with conversation models to generate interesting 
dialogue. We present empirical results showing how the narrative arc can im- 
prove accuracy on a next line prediction task. Finally, we present an expert 


user-study to validate our model. 
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Chapter 6 


dAlIrector: Contextual Story 
Arc Generation 


[In] artificial intelligence... machines are made to behave in wondrous ways, 
often sufficient to dazzle even the most experienced observer. But once a 
particular program is unmasked, once its inner workings are explained. . . its 


magic crumbles away... 


— Joseph Weizenbaum, 1966 


dAlrector is an automated director which collaborates with humans story- 
tellers for live improvisational performances and writing assistance. dA Irector 
can be used to create short narrative arcs through contextual plot generation. 
In this work, we present the system architecture, a quantitative evaluation of 
design choices, and a case-study usage of the system which provides qualitative 
feedback from a professional improvisational performer. We present relevant 
metrics for the understudied domain of human-machine creative generation, 
specifically long-form narrative creation. We include, alongside publication, 


open-source code so that others may test, evaluate, and run the dATrector. 





A version of this chapter was presented as Eger M and Mathewson KW. (2018) 
dAlrector: Automatic Story Arc Generation through Knowledge Synthesis. Joint Intelligent 
Narrative Technologies / Intelligent Cinematography and Editing Workshop at AAAI Con- 
ference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE). November 
2018. Edmonton, Alberta, Canada. 
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6.1 Introduction 


Improvisational theatre (improv) is an art-form in which narratives are devel- 
oped ad-hoc in front of a live audience [147]. Performers are prompted with a 
concise, ambiguous suggestion (e.g. a location or character relationship) and 
then share narrative development through action and dialogue. Often these 
prompts are provided by the audience throughout a performance. The most 
interesting challenge of improvisation is incorporating new suggestions, seem- 
ingly unrelated to the narrative. Improvisation’s live justification has been 
proposed as a model for real-time dynamic problem solving [200], [815]. Im- 
prov has been proposed as a grand challenge for machine learning systems [206] 
potentially as an extension to the Turing test [211], [330]. The dAIrector col- 
laborates with human improvisors for semi-automated story beat generation, 
suitable for improvisation performance, through knowledge graph synthesis. 
First, we describe some background on story generation, improvisational the- 
atre, and plot graphs (from Plotto and TV Tropes). Then, we describe our 
approach and present quantitative and qualitative evaluation. We conclude 


with discussion of limitations and future work. 


6.2 Background and Related Work 


6.2.1 Automated Story Generation 


The research problem of automated story generation (ASG) is concerned with 
generating a sequence which collectively form a narrative [59], [224]. The se- 
quence can be composed of abstract concepts such as events or actions, or 
concrete text-based elements such as paragraphs, sentences, words, or char- 
acters. Different levels of abstraction and concreteness are accompanied by 
different challenges. For instance, stories defined at high levels of abstrac- 
tion maintain step-to-step coherence easier but are simplified and lack unique, 
specific details. 

Previous ASG systems have used symbolic planning and extensive hand- 


engineering [277]. Open story generation systems use machine learning tech- 
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niques to learn representations of the domain from the training data and incor- 
porate knowledge from an external corpus [174]. Martin et al. [204] address the 
abstraction level challenges by using recurrent neural networks (RNNs) and an 
event representation to provide a level of abstraction between words and sen- 
tences capable of modelling narrative over hundreds of steps. They provide a 
method of pre-processing textual data into event sequences and then evaluate 
their event-to-event and event-to-sentence models. Our methods are distinct 
from this technique as we do not focus on the problem of sentence genera- 
tion from words or characters. The dAIrector embraces human co-creators to 
provide dialogue for given plot point descriptions and context. 

Narrative generation approaches, such as TALESPIN [223], focus on ac- 
tions, their effects, and element relationships to delineate character inten- 
tions [277] and conflict [342] which ultimately leads to satisfying an author 
defined goal. Alternatively, there may be no predefined goal, and systems may 
discover actions autonomously [324], ideally resulting in an interesting story. 
To produce an interesting story in the context of improv theatre, however, the 
prescription of actions (e.g. lines of dialog, character choices, stage directions) 
is less desirable. A description of a situation suffices to inspire actors who can 
then translate the prompt into actions, with vague and ambiguous prompts 


giving the actors freedom to explore the scene [288]. 


6.2.2 Digital Storytelling 


The ad-hoc storytelling experience present in improv theatre has been used 
for research into digital storytelling for more than two decades. Perlin and 
Goldberg [256] use concepts from improvisational theatre to populate virtual 
worlds, while Hayes-Roth and van Gent [112] describe virtual agents that per- 
form improvisational theatre, modifying their appearance to convey simulated 
emotional state. Several knowledge-based approaches have been proposed for 
various problems in the space of acting in the improv theatre, such as scene 
introduction [246], fuzzy reasoning [199], affect detection [363], and robotic 
actors [212]. 


Through collaboration between human and machine, complex stories can 
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be constructed. Generative plot systems have been developed for nearly 100 
years [59]. These systems aim to aid human creators with algorithmically gen- 
erated prompts to explore diverse plots. Through interaction with generative 
systems, users are inspired to engage with topics they would not have other- 
wise. The excitement of exploring unknown spaces and engaging with novel 


topics defines the art-form of theatrical improvisation [147]. 


6.2.3 Improvised Theatre 


In improvised theatre (improv) there is no script and no rehearsal; the show 
is written and performed at the same time. It is an art derived from the 
spontaneous justification of pseudo-randomness. Improvised theatre has been 
described as a suitable test bed for human-machine co-creation systems [200], 
[206], [212]. In improv, performers must attend to, and remember, details in 
the story and must synthesize previous information with novel dialogue and 
actions to progress a narrative. Often, the use of external prompts (or sug- 
gestions) are utilized to add entropy to the performance [147]. This motivates 
actors to justify information within the context of the current scene [288]. 
Improvised scenes can be summarized in three stages: platform, tilt, finding 
a new normal [147]. The platform of the scene defines what is normal in the 
universe (i.e. who, what, when, where). The tilt provides flavour.! A tilt 
re-frames the scene with a different context. It is what makes this particular 
performance unique from others with a similar platform. Finally, finding a new 
normal is how the scene justifies the tilt towards resolution. These three stages 
enable investigation of the ability of the dAJrector to generate cohesive plots 
and develop context-rich narrative. Our work addresses the specific aspect 
of generating prompts for the actors on stage during an improvised theatre 
performance. These prompts constitute the beats of the story in form of a 
platform, as well as tilts for the actors. It is then up to the actors to act out 


the scene to find a new normal. 





‘http: //thewayofimprovisation.com/glossary.php 
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6.3 Approach 


We present an improvised narrative co-creation system called dAJrector which 
acts as an automated director to collaborate with humans storytellers for live 
improvisational performances and writing assistance. The generated stories 
are represented as linked clauses taken from William Wallace Cook’s “Plotto: 
The Master Book of All Plots” [59] augmented with related information from 
TV Tropes?—a wiki-style database that contains narrative tropes occurring 
in a wide range of different narratives. Human artists can rapidly link the 
provided prompts to collaboratively evolve a narrative through dialogue and 
actions. In this way, the dA/rector augments human creativity. We discuss the 
challenges of evaluating a tool that, by design, provides ambiguous guidance. 


We conclude by presenting several directions for future research. 


6.3.1 Plotto 


Our work builds on the narrative development book “Plotto: The Master Book 
of All Plots” by William Wallace Cook [59], which contains a large variety of 
plots. What makes it suitable for a computational application is the graph 
structure. Rather than enumerating plots, Cook split them into fragments 
with instructions on how to combine them. 

The plot fragments constitute nodes and edges between them describe 
which fragments can be connected to obtain a story. Edges can have labels, 
which contain instructions for changing character symbols in subsequent plot 
fragments (e.g. changing character A to character B). Figure 6.1 shows a 
sub-graph from Plotto, each node a plot point and each edge a modification. 
Fragment 746 is defined in Plotto as: B, who was thought by the people of her 
community to have supernatural powers, is discovered to have been insane - a 
condition caused by a great sorrow. This fragment can be followed by either 
fragment 1441a: A seeks to discover the secret of Life, or fragment 1373: A 
sells his shadow for an inexhaustible purse. Both of these fragments make 


no mention of B, who is the main character of fragment 746. Therefore, the 





*http://tvtropes.org/ 
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ch Bto A 


DEO ED, 


ch A to B 


Figure 6.1: A sub-graph from Plotto. Nodes are plot fragments with corre- 
sponding numbers from Plotto, edges are connections between plot points and 
edge labels correspond to instructions for character modifications in subse- 
quent fragments. 


modifying edge instructs us to change A to B, ensuring consistency of the 
characters used. These three nodes represent just a small fraction of the entire 
3000 notes contained in Plotto. We automatically parse the nodes and edges 
into a JSON-based representation of the graph. Thus, generating a story is 
done by performing a walk through the graph starting at any random node. 
Eger et al. (2015) used a similar method to build a plot generator utiliz- 
ing the plot fragments [76]. Since the plot fragments in Plotto are abstract 
descriptions of plot points, ambiguous, and contain symbolic names for the 
characters, the generated plots are less than suitable for presentation to an 
audience. The ambiguity and openness make them ideal for interpretation in 
improvised theatre. These plot points represent platforms for scenes additional 


related details are needed to tilt the scene. 


6.3.2 TV Tropes 


TV Tropes is a wiki-style website that contains narrative tropes, i.e. patterns 
or situations that occur across a variety of different narratives. As a wiki, 
tropes often contain references to other, related tropes. Of particular inter- 
est for our work are TV Tropes’s plot tropes which describe high level plots 
abstractions. While related to those plots in Plotto, this TV Tropes graph 
contains unlinked semantically related, complimentary information. 


The story beat is an identifiable moment of change in a narrative [209], 
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[220], either in the form of a new platform or as a tilt. By connecting the 
Plotto graph plot points as platforms and the TV Tropes plot tropes as tilts, 
the dAlIrector creates complete abstract narrative descriptions. 

As described above, story generation using Plotto can be thought of as a 
walk through the graph of plot fragments. For a performance, dAJrector starts 
at a random node in the graph by presenting it to the actors on stage. We 
call the plot fragment presented to the actors by the system the platform. The 
actors can prompt the system for 1) the next platform in form of a new plot 
fragment or 2) for a tilt to refine the current scene, and the system will use 


the platform to determine which plot fragment or tilt to present next. 


6.3.3 Paragraph Vectors 


Our system utilizes paragraph vectors to provide information dependent on 
the current platform. Paragraph Vector is an unsupervised machine learning 
method to represent variable-length input text in a dense, fixed-length feature 
vector [166]. Paragraph vectors overcome limitations of ignoring word order 
and semantics of naive bag-of-word methods. To train the paragraph vector 
model, we use the full text of the plot fragments, as well as the descriptions of 
the tropes. We follow the training as described by Le and Mikolov (2014) and 
use the Doc2Vec? method from Gensim *. Training parameters are as follows: 

e Dimensionality of feature vector: 410 

e Initial learning rate: 0.03 

e Maximum distance between the current and predicted word within a 

sentence (window): 4 

e Ignore all words with fewer than two occurrences 

e Negative sampling is used with 4 noise words 

All other parameters are defaults as defined in Gensim. 

During a performance, the current platform is used as to find the next 
platform or a tilt. Alternatively, instead of using the current plot fragment, 


actors may also provide a custom prompt to the system to steer the plot in 





3https: //radimrehurek.com/gensim/models/doc2vec.html 
‘nttps://pypi.org/project/gensim/ 
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a certain direction. When queried, the dAJ/rector returns 5 tilt options with 
minimum cosine distances in vector space from the entire space of candidate 
options [46], [228]. For vectors A and B the cosine distance, or angle distance 
between two vectors, is defined as d(A, B) = 1 — TANIBT: 

As noted above, we use the full text of the plot fragments, as well as the 
descriptions of the tropes to train the paragraph vector model. However, since 
these descriptions are typically several paragraphs long, we only communicate 
the trope names to the actors. While we perform all comparisons against the 
full textual description of the tropes, this description is never displayed to the 
actors. This allows us to keep the instructions from the dAIrector concise open 


ended while providing additional, related information [288]. 


6.3.4 Plot Tree Generation 


Rather than generating a single plot from the Plotto graph we generate a tree, 
with one start node chosen randomly and all successors as children. Each of 
these children has its successors as children up to a configurable depth. By 
default, the platform used to determine the next scene is the current plot 
fragment, but actors may prompt the system for a plot fragment that aligns 
more closely with their interpretation of a scene or details which arose from 
the scene improvisation. A performance of such a plot tree starts at the root 
node, and proceeds down the tree, where child nodes are chosen depending on 
the prompts given by the actors. 

For example, the plot fragment Carl’s friends, Doug and Fred, believe that 
Lisa, whom Carl is about to marry, is a woman of immoral character, has 
two successors: Carl seeks to free himself from certain meddlesome influences 
and Lisa, harassed by gossip that reflects on her integrity, seeks deliverance 
from false suspicion. Depending on which aspect of the original plot the ac- 
tors decide to focus on, the interference of Doug and Fred in Carl’s affairs 
or the rumours that Lisa is of immoral characters, one or the other successor 
is better suited. Note that neither successor is completely unsuitable in any 
case, meaning that the platform can provide guidance for the system, but that 


guidance does not necessarily have to be followed. 
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6.3.5 Contextual Tilts 


In addition to simply traversing the plot tree, the system can also provide tilts 
in the form of plot tropes obtained from TV Tropes. In this case, the platform 
is used to find a selection of tropes that fit best with the current plot fragment. 
Tilts provide additional information. Thus, rather than returning the single 
trope is closest to the platform, our system computes the 5 most semantically 
similar tropes and then returns a random sample from the related set. 

Prior to sampling, plot fragments are filtered and excluded if they only 
provide redundant information. For example, if the platform is Albert, an in- 
efficient, futile sort of person, comes to believe that he is the reincarnation of 
Nicola Tesla, the best fitting trope according to our system is Reincarnation, 
which does not provide any additional information. However, the trope that 
has the second lowest distance is Deal With The Devil, which provides addi- 
tional guidance for the actors. In a performance, the actors can utilize this, 
for example, to narrow down how Albert came to his belief, by making a deal 
with the devil. This provide the scene additional directions to explore, but 
it is up to the actors to decide when they would prefer a tilt for additional 
guidance and when they want to follow their own impulse for where the scene 
should go. 

To eliminate redundant tropes, we compute the word intersection of words 
with more than 3 letters (to exclude articles, pronouns, etc.) with the plat- 
form and discard any tropes for which this intersection is non-zero. In the 
example above, because the word reincarnation is also part of the platform 


this particular tilt would not be provided by the system. 


6.3.6 Stage Presence 


For a performance, the system provides output in the form of platform beats 
and tilts, according to prompts given by the actors. The pseudo-code of this 


process is as follows: 
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# Get a starting platform 
platform = root(plot_tree) 
# Display the platform 
present (platform) 
while platform: 
# Get input from the human 
request, prompt = get_input() 
# Context is platform or user input 
context = platform 
if prompt: 
context = prompt 
# Return the platform or tilt 
if request = ”platform”: 
platform = best_match(context , children(platform) ) 
present (platform) 
1st FOCI, — titak 
tilt = random(best_n_match(context, tvtropes, n)) 
present (tilt) 


Code Sample 6.1: Pseudo-code for dAIrector plot generation. 

When the actors request either a platform beat or a tilt, the system uses 
the paragraph vector model to find the best match among all candidates for 
the given prompt, which defaults to the last presented platform beat. For the 
next platform the candidates are the Plotto plot fragments that are children 
in the provided plot tree, while for tilts the candidates are all TVTropes plot 
tropes. 

The basic mode of interaction with our system is through a console-based 
application. This application will present the plot fragments in order, and can 
be prompted for the next platform or a tilt. For a live performance, this mode 
is less convenient and therefore we also provide the capabilities for speech 
input and output, realized through speech-to-text using pocketsphina [130] 
and text-to-speech using the built-in say operating system command. Using 
this interaction mode, the system reads plot fragments and tilts out loud, 
and the next platform beat or tilt can be obtained by the actors saying the 
corresponding keywords. 

Plot fragments, as contained in Plotto refer to characters in the story us- 
ing codes, including A for the main male character, B for the main female 
character, but also very specific codes such as AUX for a fictitious aunt. As 
part of presenting the plot fragments to the actors and audience, our system 


replaces these codes with consistent character names. This is controlled via a 
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configuration file, with default names provided by the system. For clarity, we 
present all plot fragments in this paper with names replaced. Note that at this 
point we do not change pronouns if character symbols are replaced since that 
would require automatic identification of the referent of each pronoun, which 


is outside the scope of this work. 


6.4 Evaluation 


Given the unique environment of an improv theatre performance, evaluating 
the quality of output is challenging; there is often no catastrophically “wrong” 
output [212]. That said, given the platform some tilts will require significantly 
more justification to produce a satisfying narrative [288]. For our work, the 
main challenge lies with evaluating the platforms and tilts of story fragments. 

Having humans annotate multiple story fragments with the best fitting 
trope is a challenging multi-class classification problem. For example, the 
plot fragment Joe, becoming aware of an old prophecy, unconsciously seeks to 
become like the exalted protagonist of the prophecy could be seen as having any 
of the tropes Prophecy Twist, Self Fulfilling Prophecy, The Unchosen One, or 
Because Destiny Says So (among others) as the “correct” fit. For this task, we 


report how well our system reproduces human-assigned tilts on our test set. 


6.4.1 Evaluation of Tilts 


To test the key functionality of dAJrector, that of selecting a best-fitting tilt 
given a plot fragment, we design a simple task. First, we generated a dataset of 
100 clean, labelled tilt — plot fragment pairs. We split this dataset into training 
and testing sets. For evaluation, we sample a random plot fragment from the 
test set and the task for the system is to correctly predict the associated 
tilt. Given a plot fragment our model returns the 5 candidate tilts with the 
minimum cosine distance. We evaluate the system based on top-5 accuracy. 
Clearly stated for explicitness, given a plot fragment, how likely is it that the 
associated tilt is in the 5 results returned by the system? This number is 


reported over the entire 20 examples in the test set. 
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One insight gained from this approach was that while the tropes produced 
by the system are usually related to the text fragment, there are several tropes 
that are vague and apply to many scenarios, while others, which are often more 
closely related to the story fragment at hand are more specific. For example, 
in our test set, the trope “Much Ado about Nothing”, which is a generic trope 
about love, applies to a wide variety of plot fragments. The trope that was 
selected the second most often for our test set was, “Road Trip”, which applies 
to a wide variety of travel-related scenarios. 

The dAlrectory returns a random sample from the five most closely related 
tilts to a given plot point. It is therefore also reasonable to use the top-5 error 
as a measure of quality rather than the top-1 error. Top-N error is a common 
error metric for classification tasks and measures how often the target class 
does not show up in the top-N classes assigned to a test example. 

Even so, the top-5 error on our test set is 40%, while the top-1 error rate is 
66%. While high, the trope annotation task resulted in many arbitrary choices 
by human annotators. Most likely this is due to there not being a clear best 
trope, and human annotators being overwhelmed by the number of possible, 
subtly different, tropes. 

Our plot trope set contained the 700 tropes tagged plot tropes from the 
total set of 4300 tropes on TVTropes.°® For example, the story fragment 
Alfred is thrown into prison through false evidence in a political conspiracy 
was assigned the trope Get Into Jail Free by a human annotator, but our 
system returned Clear Their Name, Mystery Literature, No Mere Windmill, 
Lipstick Mark, Prison Riot as the top five tropes, all of which could also be 
deemed applicable. 

Note the difference in specificity between Mystery Literature and Lipstick 
Mark, where the latter provides a lot more detail to the actors of how to 
proceed with the scene. At present, our system treats all tropes as equally 
applicable, but, as noted above, some tropes are more general and thus related 
to the story being presented while actually adding less detail than others. 
While very specific, the definition of the trope Get Into Jail Free actually refers 





“https: //tvtropes.org/pmwiki/pmwiki.php/Main/Plots 
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to a character that wants to get arrested intentionally, which is arguably less 
fitting with the given sentence, but demonstrates the challenges faced by the 
human annotators. 

As a way to quantify this discrepancy we used information contained within 
the TV Tropes graph. Tropes linked from one another if they shared some 
commonality. We used these links to calculate a distance between tilts our 
system generated and the humans annotated, equal to the number of links 
between two tropes. 

For example, Get Into Jail Free links to Cant Get in Trouble for ’, which 
links to FrameUp as a reason for the arrest. That trope links to Clear Their 
Name as a way to resolve the situation, resulting in a distance of 3 between 
Get Into Jail Free and Clear Their Name. 

Over the entire TV Tropes set, the median distance between two tropes is 
3 (mean: 3.1, stddev: 0.6). This is not unexpected, since tropes often refer to 
“supertropes”, which refer to other “supertropes” , from which the target trope 
can be reached. However, the median distance for tropes given as tilts by our 
system, excluding those that exactly matched the human annotation was 2 
(mean: 2.5, stddev: 0.7), which is typically a connection via the “supertrope” 


common to the two tropes. 


6.4.2 Sample Stories 


To better illustrate the output from our system, we present sample stories pro- 
duced by the system along with tilts the actors might be given. These stories 
demonstrate how the produced outputs are coherent stories. The coherency 
is a product of the structure employed by Plotto. Additionally, we show ex- 
amples of stories to illustrate that the tilts the system selects can be used to 
refine, provide background information, or drive the story in different direc- 
tions. To demonstrate our system, we set the maximum length of each story 
to 5. Shorter stories tend not to have enough happening in them to qualify 
them as stories, while longer stories, owing to the structure of the book, start 
to meander to different, somewhat unrelated story lines. 


The first example is a well structured narrative with a beginning, middle, 
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and end, consisting of the following platform beats: 

1. Lana, a person influenced by obligation, falls into 
misfortune through mistaken judgement. 

2. Lana, in order to be revenged upon her enemy, Mr. Kyle, 
manufactures an infernal machine, BLOB. 

3. Lana, influenced by a compelling idea of responsibility, 
finds it necessary to protect her friend, Tynan, froma 
secret danger. 

4. Lana, suspected of treachery by her friend, Tynan, in a 
daring rescue, saves the property and perhaps the life of 
Tynan, and proves her faithfulness by a revelation of the 
danger to which Tynan, unknown to himself, was exposed. 

5. Lana seeks to correct a character weakness in her friend, 
Tynan. 

6. Lana achieves success and happiness in a hard undertaking. 

In each scene, the platform is clear and evident. In addition to the platform 
of the scene and the dialogue from the improvisors, the actors might desire a 
plot device to instigate or inspire the action. For instance, the “secret danger” 
referenced in Scene 3 is vague, and the actors might ask the system for a tilt. 
One such applicable tilt returned by our system, It Belongs in a Museum, 
provides context to further refine the “secret danger”. 

Plot fragments present in Plotto rarely mention time passing, and it is 
often up to the actors to explain jumps in time. Consider this example of a 
sequence of platform beats from our system: 

1. Alfred, in order to restore to Beatrice, without a 
confession of culpability, wealth of which he has secretly 
defrauded her, marries her. 

2. Alfred seeks to escape difficulties, restore property 
and be free of an unloved wife, Beatrice, all by secret 
enterprise. 

3. Alfred leaves his coat on a cliff at the seaside, drops his 


hat in a stunted tree below the brink, and vanishes from 
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the scenes that know him. 

4. Alfred, under a fictitious name, returns to his native 
place, where he had committed a youthful transgression, 
and, as an Unknown, seeks to discover whether his youthful 
escapades have been forgotten and forgiven. Also, he 
wishes to make reparation in an assumed character for wrong 
done in his true character. 

5. Alfred, returning as an Unknown to his native place, 
discovers no one recognizes him. 

Alfred disappears in scene 3, and then reappears what is apparently much 
later in scene 4 to wrap up his transgressions from the earlier scenes. We 
highlight this story, because when the system was asked for a tilt on scene 
3, it responded with the tilt Tailor Made Prison, which seemed unfitting at 
first. However, upon reflection the references to the coat and hat link it to the 
concept of a tailor. This is remarkable for two reasons: 1) it is probably not a 
connection that would arise immediately to a human, and 2) this can be seen 


as a pun that works well within the context of our domain. 


6.4.3 Qualitative Evaluation by Professional Improvisor 


To investigate the quality of the system we instructed a professional improvi- 
sor (fluent, native English speaker, improvisor with more than 10 years perfor- 
mance experience) to interact with the dAIrector. The performer was given an 
introduction to the system, and then explored the interaction over the course 
of several scenes. The performer discussed their impressions during the in- 
teraction. We summarize the interaction feedback below by including quotes 
from the performer. By analyzing expert opinion we can formulate conclusions 
about the effectiveness of these methods. For several points of feedback, we 
directly address the quotes inline. 
e There is a real fun in getting yourself into trouble and then putting your 
faith in the dAlIrector to do something to help. In improvisation these 
are described as “What Should I Do?” moments, when the improvisor 


decided to prompt the system for the next plot point or tilt. 
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e It doesnt know what I want the scene to be about or what decisions I 
make. This is an area for future work focused on how the actor’s dialogue 
and actions are incorporated as prompts for the system. 

e Justification is natural, and it is natural to make leaping assumptions to 
connect actions/intentions to characters as the plot generation system 
did not inherently make those assignments. 

e Tilts don’t over-complicate the narrative, it expands the story rather than 
advancing the plot, it adds flavour. The tilt is not always necessary, and 
making it optional is suitable in improvisation. That said, I prefer to 
use the system with the contextual tilts. They run the risk of throwing a 
curve-ball that is very difficult, but they are often the sort of thing that 
would be very fun to play. Tilts are a fun expansion. 


e | prefer being able to choose when the plot points and hints come 


6.5 Discussion and Conclusion 


One of the limitations of our approach is that in our graph representation of 
domains all nodes are created equal, even when the underlying data sets might 
have additional information attached to them. For example, the tropes in the 
TV Tropes data set actually frequently refer to “subtropes”, “supertropes” or 
even state “contrast with” or “opposite of” in relation to another trope. Our 
system often returns very broad tilts such as Mystery Literature, or overly 
specific tilts such as Lipstick Mark, without any means to control which one to 
get. However, utilizing the structural information contained within the data 
set could lead to tilts that are better suited for any application. Extracting 
this structural information is non-trivial as it is not structured meta-data. 
Additionally, while TV Tropes is a useful resource, it is a large dataset that 
suffers from common dataset quality and inconsistency issues. One way to ad- 
dress these limitations would be to use a subset of tropes that are particularly 
narrative building. This would require filtering based on a heuristic (learned 
or pre-defined) which can classify and rank tropes based on narrative build- 


ing qualities. Some of these qualities could be information about the universe 
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gained through introduction of the trope, or dynamic shifts between characters 

We used Plotto and TV Tropes as our data sources because they cover a 
wide range of different narratives. It would also be possible to expand the 
dAlrector to use more specialized databases such as DramaBank [78]. By 
designing a structured graph of information, textual plots of TV show episodes 
could constitute an interesting data source as well. This could allow for plots 
which extend over seasons, character arcs, individual episode, or scenes within 
an episode. 

Treating the plot fragments as a graph allows us to use the story-generating 
walk for other data sources which can be represented as graphs. We are also 
considering a possible application outside of narratives: the directed explo- 
ration of large graphs, such as Wikipedia for knowledge synthesis. In this 
scenario we could target users browsing a certain topic, starting an article of 
interest. The user could then request linked articles, constrained to articles 
semantically related to a query from a different data source. 

This chapter provides a case-study in structured plot generation by fus- 
ing knowledge in multiple graphs. Structured story generation is an ongoing 
field of research [81], [355], [358]. Human writers often write a coarse outline, 
and then focus on details in each section [312]. By modelling sequential story 
generation problems at multiple levels of abstraction (i.e. words versus narra- 
tive plot points), hierarchical models can specialize, then combine to generate 
complete stories [80]. The next chapter discusses additional future directions 


for human-machine narrative generation. 
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Chapter 7 


Discussion 


Technology made it easy for us to stay in touch while keeping a distance, ‘til 


we just stayed distant and never touched. Now all we do is text too much. 


— Sage Francis, 2010 


Improvisational theatre performances connecting humans and intelligent, 
interactive machines serve as an exciting, illuminating, and challenging domain 
to study conversational dialogue systems. In this chapter, we discuss the 
impact of the work in this thesis and how the fields of interactive machine 
learning (IML), conversational dialogue systems, and improvisational theatre 
have evolved. With rapid advances in machine learning technology, and an 
open and sharing scientific community, there has been significant growth at 
the intersection of these fields. This chapter presents analysis of how artificial 
improvisation has been received by performers, audiences, reviewers, and the 
public media. As artificial improvisation is a novel art-form, it is important 
to reflect on how it is consumed and understood by the public. Finally, we 
discuss several research directions which will progress the intersectional field 
forward in the future. These are less predictions of what will happen next, 
rather prescriptions of work that could be done in an effort to address some 
of the toughest ongoing challenges. 

In this thesis, we present the grand challenge problem of artificial impro- 
visation, or using interactive machine learning in the domain of improvisa- 
tional theatre alongside a conversational dialogue system. This is the ideal 


domain for studying some of the most difficult problems in human-machine 
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interactions. It provides for real-time interaction between humans and ma- 
chines, with observers in the safe, controlled environments of the theatre. We 
then presented two artificial improvisors. First, we presented Pyggy, a rule- 
based chatbot which used human interaction training logs to respond to user 
interaction and used a projection-based embodiment. Then, we presented 
A.L.Ex., the Artificial Language Experiment, which aimed to address several 
of the challenges encountered when performing alongside Pyggy. From there, 
we presented the show/experiment Improbotics exploring how audiences re- 
act to humans delivering lines from A.L.Ex. or other humans. Finally, we 
presented methods for improving several challenges in engaging interactive 
narrative generation. Shaping the Narrative Arc aims to build an underlying 
belief structure of the scene by modelling the co-constructed universe, and 
dAlrector aims to merge knowledge from related graphs to form coherent plot 
structures to frame improvisation. These works represent a collective research 
push to progress improvisation toward natural human performance alongside 
these artificial improvisors and to progress the underlying machine learning 


technology by incorporating human interaction. 


7.1 Human-Machine Improvisation 


How can the robot and the human interact physically on stage? Background 
on this intersection is covered in Section 2.3.2. Since the inception of this 
thesis work there has been an evolution of the research field and the art-form. 

We have invited more machine learning systems into our lives over the 
last half decade than in all previous years. Devices such as Amazon Alexa or 
Google Home are common, these systems offer conversational interaction with 
a simple conversational “Hey, let’s have a chat!” By inviting these devices into 
the home, the car, and the office, we are enabled to perform a wide variety of 
tasks faster and easier than before. The ubiquity allows companies to build 
conversational datasets at rates previously unthinkable. This interaction has 
been fueled, in large part, thanks to the advances in labs around the world 


and the collaboration of hardware and software providers to bring costs down 
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to consumer levels. 

There are many games, applications, and interactive experiences available 
on these devices which allow for conversational interaction which includes chit- 
chat dialogue (see Section B.7). This also means that the expectations of 
users have increased. Users expect engaging, delightful conversations to be 
available over a near-endless set of topics and dynamics. This work has in 
large part been supported by conversational researchers adding to the field 
of conversational interaction. That said, the influence of improvisation in 
modelling these dialogue improvements can not be understated. It is not only 
the work in this thesis that uses rules from improvisation to model improved 
open conversation [176], [206], [207]. 

The art-form of modern performative improvisational theatre is still young, 
and has itself, experienced significant growth over the years which the work in 
this thesis represents. Online streaming venues such as Twitch and YouTube 
Live have brought broadcast live improvisation, comedy, and theatre to larger 
audiences and have started to integrate user interaction in the performances. 
Digital puppetry with avatar representations of humans, or v-tubers, are be- 
coming increasingly popular. These innovations are not the only way that tech- 
nology has been integrated into improvisation. Shows in Canada and Germany 
have started to involve wireless networks for users to connect to before, during, 
and after the show to collaborate and share information.’ Additionally, use 
of generative models has significantly influenced musical improvisation perfor- 
mance [125]. Music generation models from Google’s Magenta were used in a 
June 2018 performance of Improbotics (see Chapter 4) in Edmonton, Canada. 
Similarly, these models have been used in musical writing and performance by 
Pablo Castro in collaboration with David Usher [45].? Similarly, generative 
models continue to be used extensively by the comedy collective Botnik in live 


comedy and musical performance.’ 





‘http: //hipbang.ca/ 
2https://goo.g1/QmYFQ4 
3https://botnik.org/ 
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7.2 Critical Reception 


Artificial improvisation is an art-form and science experiment that the general 
public are eager and willing into engage with. The evolution of these technolo- 
gies from experimental to production ready has allowed for rapid integration 
by artists and directors. That said, this integration of technology has both 
lowlights and highlights. By consistently and deliberately reflecting on re- 
views and commentary on these performances we aim to grow and evolve the 
technology powering the performance and the show itself. Thus, we present 
reviews and critical reception of the technology and show below with humil- 
ity and much appreciation for the hard work and expertise of the writers and 


reviewers. 


7.2.1 From Audience and Improvisors 


One of the best means of evaluating a creative collaborator, is through qualita- 
tive responses from performers and audiences. Toward that end, the response 
from improvisors has been strong and supportive. Several comment themes 
have been synthesized in Chapter 4 from performers who have worked along- 
side the artificial improvisor. Briefly, these comments fall into several major 
themes: 

1. Artificial improvisors cannot tell complete stories. 

2. Improvising with the artificial improvisors is more work. 

3. Performing with these systems forces you to be a better improvisor. 

4. It is like performing with a novice improvisor. 

5. There is fun in getting yourself into trouble, and then putting your faith 
in the system to do something to help. 

In addition to these comments, the system has been studied by professional 
improvisor Rachel Rosenthal, who reflected on whether or not these systems 
threaten her likelihood and how they inspire other artists.* In her article, “Will 
robots replace improv comedians?” she presents a view which is initially very 


skeptical but eventually supportive after exploring several questions compar- 





“nttps://goo.g1/V6CsR4 
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ing human and machine improvisors. She emphasizes that while the robot can 
not emote, a critical skill for an improvisor, it does draw empathy from human 
audiences. This reinforces our choices of humanoid embodiments which cap- 
ture those feelings of connection effectively. She concludes by reiterating her 
fascination with the system and desire to play alongside it in the future: “I 
definitely at least want to try improvising with A.L.Ex at some point. I mean 
I have to, right? (Of course, this is how it all starts...)”. These comments are 
reassuring to us, we have sparked the curiosity of mind from expert perform- 
ers at the top of the art-form. Artificial improvisation is compelling to these 
performers for how it challenges possibilities and inspires expert storytellers 


to tell new and interesting stories. 


7.2.2 From Reviewers 


Performances of the artificial improvisation work described in this thesis have 
been reviewed by theatre reviewers over multiple shows. These reflections 
provide an honest assessment of the art-form’s impact on society in relation 
to other contemporary performances. As developers, we have embraced the 
feedback and learned from these reactions to evolve and improve the show. 
One point of interest is tracking how the reaction to the show as evolved 
over the years, likely due to the shows evolution and the increase of artificial 
intelligence-influences in popular culture. Several excerpts are included below 
for reference. 

1. Edmonton Fringe review of “Artificial Intelligence Improvisation”, by 
Josh Marcellin, 18 August 2018. “It’s almost as much fun hearing 
the researcher-performers explain the science behind A.L.Ex with undi- 
luted glee as it is watching the AI come up with absurdly, often dark, 
non-sequiturs for the humans to riff on. Boundary-pushing science plus 
talented meat-bags equals a stand-out improv Fringe experience.” 

2. Broadway Baby’s review of “Artificial Intelligence Improvisation”, by 
Alex McCord, 5 June 2018. “|They] create a unique comedy experi- 
ence with their machine companion. Veering wildly between moments 


of cringe-inducing strangeness and total hilarity, it is a show that must 
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be seen to be understood. It sounds exceptionally strange. It is excep- 
tionally strange... The genius of the robot is in the way it addresses 
the flaws of improv comedies and gives them a reason to go to weird 
places and say weird things. One brilliant set piece is when they put the 
robot avatar aside and bring up an audience member and make them 
say whatever dialogue the program generates... [T]he idea here is mag- 
nificent. There are so many laughs to be found with this formula and, 
with improvements both from our scientists and A.L.Ex, this could be 
revolutionary.” 

3. “Artificial Intelligence Improvisation at the Etcetera Theatre”, View 
from the gods, 31 July 2017. “The most fascinating aspect of Artificial 
Intelligence Improvisation is that despite the characters being make be- 
lieve, the technology behind this show is genuine... Gimmicky? Maybe 
a little. However it’s also very cool and a lot of fun to observe... they’re 
very well received by an enthusiastic audience ... Slightly ramshackle 
and prone to system error, but thoroughly enjoyable.” 

4. “Artificial Intelligence Improvisation”, by Kate Pettigrew for London 
Pub Theatres, 1 August 2017. “It is always interesting when something 
different is tackled in theatre and Artificial Intelligence Improvisation 


does just that.” 


7.2.3. From News Media 


Popular media’s coverage of artificial improvisation is a testament to both 
its novelty and impact on society. To date, there have been plenty of media 
covering different angles of the work from news outlets including Bloomberg, 
the Wall Street Journal, and the New York Times. A complete list is included 
in Appendix G. The articles fall into four major thematic groups: 

1. Introductions to the technology 

2. Putting AI to the test with humour or comedy 

3. Teaching comedy to machines 

4. A robot tricking people 


Introductions to the technology are pieces which are related to the science 
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behind the system. These articles introduce machine learning and often profile 
specific underlying technologies. These articles serve as vehicles for scientific 
communication; a means by which we as scientists and researchers can explain 
our pursuits in a manner that is not mired in academic language. These articles 
aim to serve the scientist and the general public audience. 

Several articles take the angle putting artificial intelligence to the test. 
These articles often make an assumption that machines will compete with 
humans on specific tasks. One task is improvised theatre, and these articles 
explore the notion of the robots ability to perform that task. If the robot can 
start to make jokes, what could it do next? These articles serve to present 
the research in a competitive fashion. These articles serve a large fear-based 
narrative of humans competing with machines, as discussed in Section 2.3.2. 

In the collection of news media, some of the press has taken the angle of 
teaching comedy to machines. These pieces often lay out machine learning in 
simple terms and then present comedy, humour, or wit as another technique 
or skill that can be taught to machines. While these articles may serve the un- 
derlying science they often include details on pattern recognition or statistical 
inference. They do not make connections that the learning they are discussing 
is often finding patterns in input data and generating a corresponding, high 
probability output. While this would be a more difficult story to tell, it would 
help to bridge the gap between human and machine learning. 

Finally, there is press which hones in on a specific part of our work, that 
of deception. These pieces often use engaging language to capture attention. 
They focus on how the machine is doing its best to trick human audiences, 
or attempt to fool observers. Many of the pieces start from inspiration of the 
Imitation game, and then spiral toward deceit of human trust. We feel as 
though this media does not serve the science, nor does it serve the general 
public. We can imagine that these sorts of stories are used by news media to 
generate interest. Compelling stories of human’s tricked by machines capture 


attention, but they are only thematically related to this work. 
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7.3 Future Directions 


Artificial improvisation starts with dialogue, but that is not where it ends. 
Mastering the interactive capabilities of artificial improvisation requires a 
closeness, similar to the relationship between a musician and their instru- 
ment [331]. Intimate, dialogue-driven, relational closeness with these systems 
can create augmentative solo performers. These performers could simultane- 
ously affect all facets of stage productions. The artificial improvisor could send 
control signals to lighting, sound, and video projection systems. Projection 
was a first step with Pyggy, and we have experimented with incorporating 
images retrieved based on queries related to topics from the scene [352]. This 
represents a significant engineering effort that will be possible to produce in 
the near future. There is still much ground to be be covered by using sim- 
ple language models and searching for nearest neighbours in a latent sentence 
embedding space. Our open source release jann®, makes it easy for users to 
build and deploy a dialogue system over using Google’s DialogFlow® to allow 
rapid integration over Facebook, Twitter, Slack, and Google Home. We pre- 
dict that these tools will allow many more interactive dialogue applications to 
be deployed for large populations of users. 

As discussed in Section 4.6.4, we expect that timing (the number one rule 
of comedy) to be improved by future iterations of these systems. Furthermore, 
there are several interesting future studies to explore extending on the work 
presented in this thesis. Below we present several studies of interest which will 


support the continued growth of the field. 


7.3.1 Improv in Any Language 


With the increasing ease of travel between nations, and improvements in re- 
mote video conference abilities, connections are made between performers and 
audiences of different languages. Theatre is one of the most important tools 


we have for sharing experiences and building cross-cultural understanding. 





“https: //github.com/korymath/jann 
Shttps: //dialogflow.com/ 
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Improvised theatre is positioned to connect culture given its accessibility, uni- 
versality, and low barriers to entry; improv requires no scenography, costumes, 
lights, or sound equipment. Improv can be performed by individuals with lit- 
tle or no previous experience. Although the quality of performance may vary 
given performers of differing skill levels, basic improvisation can be easily un- 
derstood and internalized. In a short manner of time, individuals from diverse 
cultures can perform scenes together with diverse characters, relationships, 
settings, motivations, and even subtext. This can be done even when the 
performers do not share a common spoken language. 

The art of improvisation is derived from the connections between performa- 
tive layers. Both between the performers, and between the performers and the 
audience. Improv embraces the audience to create collaboratively together. 
In this way improvisation is a democratic narrative. The potential impacts 
of improvised theatre between performers and audiences of different cultures 
and varying experience levels is significant. Most international improvisational 
collaboration is English based, but many regional festivals take place in the 
language native to the host country. This excludes audiences without knowl- 
edge of the performance language, and limits the contributions by performers 
who do not speak the language. Without advanced translation technology, 
improvisation can lose important voices due to language limitations. 

How can we create conditions so that improvisors from different cultures 
can improvise together in their own language? How can audiences understand 
performers using diverse languages? The techniques used for artificial improvi- 
sation can be translated, literally and figuratively, to address these questions. 
By connecting all performers with microphones and headphones, audio signals 
can be routed through computer systems. Speech recognition can be used to 
recognize the language of the spoken line. If the text is to be translated, it can 
be sent to a translation service (e.g. Google Translate API). The translated 
text can then be synthesized and the audio can be sent to the speakers in the 
theatre, to the headphones of select performers, or to select audience members. 

Timing is a critical issue for this configuration [11], [173], and sub-second 


latency is important for such a system. This could require predictive text mod- 
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elling and nearest neighbour approaches for immediate response when neces- 
sary, in addition to neural language modelling for diverse response generation. 
It is important to integrate these delays into the framing of such a perfor- 
mance (similar to how the Turing test framing enabled studies in Chapter 4). 
Translation accuracy is another major challenge in this pursuit. This accuracy 
could likely be improved through contextual modelling by incorporating topic 


and scene information [201], [215], as well as cultural information [35]. 


7.3.2 Starting and Finishing Conversations 


To extend the work presented in this thesis, one likely next step will be toward 
initiating and concluding conversation. Having an agent that can maintain an 
ongoing conversation is a significant step, but allowing the agent to start and 
finish conversations will require significant research progress. Through human 
studies, researches might collect natural human dialogues [100] over a prede- 
fined set of topics and inputs [305]. From these natural conversations, models 
may be learned specifically for the initiation of a conversation. Similarly, di- 
alogues often end by recapitulating what was discussed, summarizing topical 
contributions, and then providing concluding remarks. Currently, most inter- 
actions with these dialogue agents conclude when the user decides, often once 
the system generates one-too-many uninteresting responses [180]. Through 
continued interaction with humans, these systems can build growing datasets. 
They can start to learn online and adapt in real-time by following these two 


research directions. 


7.3.3 Tuning Language Models with Rewards 


Current generative text models are limited to learning the statistics of the 
underlying training dataset. Fine tuning of trained generative deep neural 
networks using self-play in the domain of open dialogue conversation may 
be possible. This will allow users to define a reward scheme to personalize 
and tune models following large scale training. This tuning will allow for 
creative artistic expression through defining interesting reward schemes—such 


as sarcasm, style, meter, or rhyme scheme. 
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Current deep neural networks strictly model the underlying statistics of a 
training dataset. Once trained these models are limited in their adaptability. 
In image and video domains, it is common to initialize the weights of a neural 
network with those trained for a separate task and then fine tune the final 
layer with new task-specific labelled data. This pretraining and transfer is 
less common in text domains where corpus and task may vary drastically 
from domain to domain. There is likely to be a breakthrough work extending 
pretraining for text. As Ruder (2018) said, “NLP’s ImageNet moment has 
arrived” .’ While working fully pretrained generative language models is less 
common often pretrained components are used in the training pipeline [46], 
[70]. There is also work which proposes tuning trained neural network models 
using reinforcement learning. RL has been used to tune trained models for 
music, molecule sequence, and robotic movement sequence generation [54], 
[138], [140], [179]. These models harness contextual information contained 
within human-designed reward schemes or directly from feedback from human 
observers. 

This work specifically could build on the trained models of the Artificial 
Language Experiment. A.L.Ex. is trained on a corpus of movie dialogue 
data [340], and is used as the dialogue manager in a variety of improvisational 
performance tasks as discussed previously in this thesis. A key principle in 
improvisation is positivity in interlocutor (or speaker) statements, this concept 
is summed up as “Yes, and...” in improvisational theory. This positivity in 
statement can be measured through sentiment scoring. VADER is a sentiment 
scoring model which can be trained on labelled data and then provide a mea- 
sure of a statements sentiment [95]. Preliminary work towards this sentiment 
modelling is presented in Chapter 4. 

We expect that post-training tuning will allow for a designer to build arbi- 
trary reward schemes to specialize trained generative neural networks. Specif- 
ically, we expect that using the sentiment score as a tuning reward during self- 
play—simulated self-dialogues—will improve a models ability to generate novel 


positive sentiment statements, and thereby will improve the models ability to 





"http: //ruder.io/nlp-imagenet/ 
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perform improvisational theatre through conversational dialog. To evaluate 
the performance of reward tuning, researchers will need to compare generated 
conversations average sentiment scores between tuned and un-tuned networks 
across the experimental conditions. They will also need to compare down- 
stream tasks affected by sentiment (e.g. observer delight and participant en- 
gagement). Additionally, optimizations might aim to maximize information 


shared between interlocutors. 


7.3.4 Adapting Language Models in Real-Time 


It is desirable for trained and tuned deep neural networks to adapt to the 
emotional state of the speaker it is interacting with. Adaptation could occur 
over a variety of features: including grammar, vocabulary, topic, emotional 
salience, sentiment, and tone. Current deep neural networks can be trained 
with a large dataset, but often user specific variation or personalization is 
difficult to achieve due to limited numbers of samples relative to the size of 
the training set. 

I propose that the adaptability module from the previous experiment could 
be extended with a variety of linguistic metrics to nudge the model towards 
adaptability in real-time. The chameleon model, named after the ‘chameleon 
affect’, would adapt speaker style more readily than a method lacking adapt- 
ability capacity [50]. 

Through the use of RL techniques from the previous experiment and build- 
ing on current speaker modelling work [364], A.L.Ex. might be adapted in 
real-time to to align responses to an updating model of the other speaker in 
the loop. The system could match semantic variables such as tone and emo- 
tion and maintain a separation in syntactic content. This would allow for a 
system which can continue an interesting conversation, constantly mirroring 
shifts in speaking style, while avoiding the pitfall of simply repeating what the 
other speaker said. 

To evaluate the performance of these adaptive models, it is likely that 
a computational definition of what speaker modelling, mirroring, allowance 


and convergence looks will be adopted [65]. For instance, a set of features 
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in a given piece of input text are defined, and the features for the generated 
response are then compared with the input text [64]. This work has proposed a 
large number of features over which to compare input/output pairs including: 
sentence length, noun count, sentiment, language complexity. With these 
features clearly defined and computable, a model learns online to minimize 


the difference between the generated and input responses pairs. 


7.3.5 Evaluating Conversational Dialogue 


A good dialogue system should be capable of sustaining a coherent con- 
versation with a human by staying on topic and by following a train of 
thoughts [338]. Former measures of conversational dialogue evaluation often 
measure the capacity of a model to be consistent with a dataset (Section 2.2.7). 
Ongoing research aims to build comprehensive conversational dialogue eval- 
uation metrics across languages [328]. For proper evaluation analysis should 
compare how a model can provide semantically linked responses while balanc- 
ing how much it is repeating examples similar to the input data. That is, 
there is a desire to measure a models capacity to generate unique, diverse, 
meaningful responses which are close to the expected output, but perhaps not 
directly aligned. In a 2018 paper, we proposed semantic similar and response 
echo index to address this exact evaluation [73]. Semantic similarity measure 
the capability of a model to be consistent with the semantics of the input 
context and message, thereby maintaining topical consistent. To render the 
semantic representation of an utterance, we leverage Universal Sentence En- 
coder [46] wherein a sentence is projected to a fixed dimensional embedding 
vector. Other vector based sentence embeddings may provide complimentary 
information for this evaluation metric [70], [156]. Response echo index assesses 
how well a model is able to generate plausible responses measurably distant 
from the input dataset. Used together, these metrics can be used to reduce the 
burden on human evaluators. We compared a novel conversational language 
model (THRED: Topical Hierarchical Recurrent Encoder Decoder) with two 
strong baselines using these two metrics as well as human evaluation. Our 


results indicated that the THRED model is able to generate more diverse and 
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contextually relevant responses compared to the baselines. The results also 
indicated that our automated metrics correlated well with subjective human 
judgement. Additional details on the semantic similarity and response echo 


index metrics are included in the full paper [73]. 


7.4 Summary 


In this chapter, we presented reflections on the intersection between interactive 
machine learning, dialogue, and improvised theatre in the context of artificial 
improvisation. This intersection will develop further as the technology is em- 
braced by more performers around the world. Improbotics has produced more 
than 75 shows in five cities, and continues to grow as others learn to incor- 
porate artificial improvisation. This growth is enabled by embracing critical 
reception and feedback from audience members, improvisors, reviewers, and 
news media. Reflections from audience members, improvisors, reviewers, and 
news media reassure us that our innovations are inspiring and challenging. Our 
work pushes performers to tell stories in new and novel ways, and in doing so, 
develop the technology and themselves. Reception from media represents an 
interesting arc of covering the development narrative of the technology. It 
started as burgeoning ideas and has evolved into fully fledged systematic per- 
formance tools. Finally, this chapter presented several directions for future 
research reflecting the ongoing interests of the intersectional research com- 
munity. These include multi-lingual capabilities, starting and finishing con- 
versational dialogue, tuning and adapting language models in real-time, and 
improving the way we evaluate generative conversation systems. These direc- 
tions prescribe how incremental innovations will define the future of artificial 


improvisation. 
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Chapter 8 


Conclusion 


We can only see a short distance ahead, 


but we can see plenty there that needs to be done. 


— Alan Mathison Turing, 1950 


The main contribution of this thesis is the development of artificial im- 
provisation: improvised theatre performed by humans alongside intelligent ma- 
chines. This is a novel art-form and field of scientific pursuit found at the 
intersection of interactive machine learning, conversational dialogue systems, 
and improvisational theatre. The core motivation is to understand how hu- 
mans and machines can interact through dialogue during improvised theatre. 
How have they interacted in the past, what does their interaction currently 
look like, and what might it look like in the future? The material in the 
previous chapters attempted to answer these questions. 

Chapter 2 contains background material covering the scientific fields of 
interactive machine learning and dialogue generation, and details underlying 
the art-form of improvisation. Chapter 3 presents the the first experiments in 
live stage human-machine improvisational theatre for interactive open-domain 
dialogue research. The first two artificial improvisors are presented: 1) the 
rule-based conversational dialogue system Pyggy, and A.L.Ex., the Artificial 
Language Experiment. A.L.Ez. is a corpus-based neural response generation 
system which several key technical improvements over Pyggy. 

Chapter 4 presents quantitative and qualitative evaluation of artificial im- 


provisation and a theatrical Turing test performance format called Improbotics. 
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The chapter presents performer and audience evaluation of artificial improvi- 
sation using A.L.Ex. and systems discussed in the previous chapters. 

Analysis of the experimental results from Chapter 4 identified several 
specific improvements for artificial improvisation. Shaping the narrative arc 
(Chapter 5) is a model able to incorporate an underlying evolving story arc in 
collaborative narrative generation. dAIrector (Chapter 6) is a novel method 
for synthesizing a plot graph with contextual information. The method can be 
used for automated scene direction through contextual plot generation. While 
these two advances can be used in support of artificial improvisation systems 
such as A.D. Ex., they are independent contributions that do not depend on 
the work in the previous chapters. These augmentative models were designed 
to be agnostic to the baseline artificial improvisation system and may be used 
in conjunction with other improvements. 

The contributions in these chapters are significant contributions towards 
improvised theatre performed by humans alongside intelligent machines. To- 
gether they represent a significant amount of scientific inquiry, engineering 
effort, and artistic endeavour. Given the public nature of this work, criti- 
cal reception from improvisors, audience members, reviewers and news media 
are presented and analysed in Chapter 7. The chapter also discusses future 
innovations for artificial improvisation. These future directions include un- 
derstanding and experimenting with response timing, starting and finishing 
conversations, and adapting conversation models in real-time. 

This thesis presents a view of human-machine collaboration for impro- 
vised stage performance. From conceptualization, design, development, to 
evaluation, outcome analysis and discussion, this thesis represents a body of 
published research and over 75 live performances around the world. These 
contributions are made possible through interesting dialogues, inspired collab- 
orations, and chasing what-if scenarios; by accepting what came before, and 
expanding and advancing into the future. It is our hope that this work serves 
to inspire artists, researchers, and scientists. We hope to uplift the art and 


science underlying artificial improvisation. 
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Appendix A 


IML Research and Future 
Directions 


In this section, we discuss ongoing research in the field of interactive machine 


learning (IML) and several exciting avenues for future work. 


A.1 Active IML Research 


Learning from demonstration [290] (i.e. programming by demonstration, imi- 
tation learning, or apprenticeship learning) explores IML systems which learn 
from a set of human performances. The teacher performs the same task sev- 
eral times and the machine learns to reproduce the humans performance. As 
with any ML task, the difficulty is in generalizing over inputs and handling 
novel, unseen configurations. These methods, often used in robotic movement 
programming, work best when there is a correspondence (i.e. matching joint- 
to-joint) between the human and the machine, and there is a rich interface 
for demonstration. Dautenhahn and Nehaniv (2002) propose difficult in this 
paradigm stems from learning: “What to imitate? How to imitate? When to 
imitate? Who to imitate?” [66]. 

Human prompted switching is another promising area of IML. In this do- 
main the human performs tasks with some controller and switches control 
modes with a meta-controller. This type of interaction is common in pros- 
thetic robotics, and there has been work exploring adaptive and autonomous 


switching of control models using reinforcement learning [75], [260]. If the 
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system can predict when the human will switch modes, then the system can 
adaptively switch autonomously, thus saving the human time and effort, which 
they can in turn use for more complex control. 

Interactive shaping allows a human to provide feedback to an ML sys- 
tem during task performance. TAMER (or training an agent via evaluative 
reinforcement) is an algorithm for training a system with evaluative reinforce- 
ment using a single human input channel [159]. TAMER is modeled on how 
one might train a dog or a dolphin, using techniques of clicker training [266]. 
Clicker training methods have been used in human-robot interaction for nearly 
two decades; Kaplan et al. (2002) used a clicker to train a small Sony Aibo 
robot dog [152]. Evaluative reinforcement techniques are built on the assump- 
tion that the learning system can explore a space of possibilities efficiently, 
can link actions in that space with perceptual clues from an expert, and that 
those perceptual clues are connected with rewarding or psychologically linked 
responses from a trainer. 

Some IML problems contain too dense a perception stream to learn to rea- 
son over diverse inputs with noisy human inputs. Thus, work has been done 
to more effectively use human input to learn the ‘goodness’ of certain mapping 
functions and why humans can efficiently learn these mapping functions [119], 
[160], [175]. In these works, the authors explore many of the main hurdles of 
learning with evaluative reinforcement, those of reward positivity (i.e. humans 
giving many more positive reinforcers), temporal discounting (i.e. linking cer- 
tain outputs with feedback which may be received after a time delay), and 
demotivation of humans (i.e. humans slowly decreasing the amount of good 
feedback they give after several interactions). 

Other methods have been used in human robot interactions, including: 
learning from advice! [217], designing robot learners that ask good ques- 
tions [37], cross modality perception [44], Bayesian modelling of human feed- 
back strategies [190], and simultaneous control and feedback in the training of 
a robotic agent [210]. Additional IML strategies are discussed in a 2016 review 
article [326]. 





'This idea was postulated by John McCarthy in 1958 
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A.2 How IML Continues to Evolve 


Given past and active research in IML, there is a promising future for how 
interaction will play a critical role in the training and teaching of machines. We 
are inviting more of these systems into our interactive spaces. These devices 
include dense communication streams, rich perceptions, and great potential 
for valuable interactions to make our lives better. We should offload tasks 
which these devices are capable of collaborating in, and focus on those tasks 
that we, as humans, excel at, enjoy, and appreciate most. 

Many IML problems require expert humans in the learning and teach- 
ing loop, at least for the initial training. Problems such as poisonous mush- 
room Classification”, cucumber quality sorting®, or diabetic eye disease detec- 
tion [103], demand a large, unbiased training set and often a human expert 
to interpret results. These systems aim to empower human experts, that are 
already knowledgeable in their field. Similarly, the idea of advanced chess—also 
aptly named centaur chess—popularized by Grandmaster Garry Kasparov pits 
human-machine partners against opponents.* These collaborations are effec- 
tively using the strengths of the machines, expert humans, and the interaction 
between them to best perform complex tasks. 

I imagine that the decentralization of ML will happen rapidly. With the 
introduction of blockchain technology [366], distributed file systems’, and ad- 
vancements in encryption and privacy-preserving technology [63], there is po- 
tential to securely share model capabilities without sacrificing user informa- 
tion. User identification is going to improve, to the point where a IML system 
will be able to uniquely identify multiple users interacting with it at once us- 
ing microphone/camera arrays systems combined with voice/face modelling. 
With improvements identification and biometric validation, we can imagine 
more secure interactions with these systems in public spaces. 


There is a difference between using a pre-trained model and an individual 





"https: //www.kaggle.com/uciml/mushroom-classification 
3https://goo.gl/fpB9hh 

“nttps://goo.g1/157r1i 

*https://ipfs.io/ 
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training their own model. When a user trains their own model, they develop a 
connection through their own unique interaction. Their communication style, 
training curriculum, and inputs with the learning machine are exclusive, and 
thus, the trained system becomes personalized to that specific human teacher. 
Given this, consideration must be made to how these models can be saved, 
updated, protected, shared, encrypted, updated in isolation, and version con- 
trolled. 

Thought has already gone into this for several applications including im- 
age classification and natural language processing. Pre-trained models, which 
often taken large data sets and high-end computers to train, are available 
through open source licenses [70].° By starting with pre-trained models, user 
specific applications can be rapidly deployed through transfer learning. Fine- 
tuning and modification of a large model can happen in a fraction of the time, 
with only a subset of task specific input/output data pairs. 

Distributed ML (i.e. using multiple machines) and federated learning (i.e. 
across multiple users) [222] emerged the subfield of collective ML [94]. Here, 
the goal is for a group of users train a single, often large, model together. This 
poses several challenges, if they do not want to share their private informa- 
tion. How does this training happen securely, safely, quickly, and in a way 
that benefits all the model users and data contributors? These questions are 
being explored by open research groups including OpenMined’, a community 
of developers, of which I am a proud member, focused on developing tools 
for secure, privacy-preserving, value-aligned AI. This on-device, privacy first 
approach to machine learning will shape the future of IML research. 

There are two sides to any interaction, in this case the human and the 
learning system. It is natural to imagine scaling both sides of the equation. 
The human could teach a group of agents all acting autonomously. Each of the 
agents may have a different learning scheme, or personality, and the human 
should be able to handle and control the collection of agents. Perhaps each 


of the learners could then share the knowledge they have garnered from the 





Shttps://github.com/tensorflow/models/ 
"https: //www.openmined.org/ 
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humans feedback, thus creating a diligent team of agents, with knowledge of 
each other and the human. Another potential future direction may see multiple 
humans collaboratively guiding a single agent. We can imagine systems that 
can accurately track, sense, and perceive large groups of people, like crowds or 
audiences [153], [158]. Through group interaction, large amounts of training 
data can be collected in parallel. The combination of multiple learners multi- 
ple humans providing feedback, potentially with different immediate feedback 
paradigms, may provide more informative, multi-timescale learning. 

Interaction initiative is often entirely with the human or the machine, one 
future direction would be to focus on how a transparent learner can steer the 
instruction it receives from a human. This would create a reciprocal interac- 
tion, creating a human which is effective at shaping a given agent. This is 
particularly important when considering that the feedback form the user may 
be noisy. The human may not know at first how to best provide reward, and 
thus the feedback may be sporadic, too frequent, or altogether wrong. By 
providing some indication of how the agent is understanding the feedback, the 
human can modify their reward signals to be more effective. 

Not only will the perception stream of the systems improve, but the embod- 
iments will improve as well. Most embodied IML systems are robotic devices. 
These mechanical marvels are often complex integrated circuits and analog 
electronics. There are already groups exploring projected avatars, holograms, 
and immersive worlds for humans in mixed reality to explore [97], [248]. With 
the rapid commercialization of augmented and virtual reality, worlds for hu- 
mans and machines to interact can be designed and developed by artists and 
video game developers. This would extend the current embodiments of simple 
lights, actuators, and sensors to an augmentative, projected reality, mixing 
physical and virtual worlds. 

With advancements in ML research and advanced engineering of 
production-ready systems, more powerful models will be deployed on smaller 
devices. Today small neural networks can run offline and on-device, allowing 
for systems to interact without internet connectivity. These methods continue 


to improve and integrated computation is becoming more compact. Devices 
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such as tattoos, contact lenses, hearing aids, and implanted cardiac devices will 
be able to run on low power, for long periods of time, with ML models. This 
could augment our sensory perception, vision, hearing, and general health. 
Passive perception will be a fruitful area of development for these devices. 
The inclusion of passive sensors such as solid-state LiDAR and image and 
depth-based gesture tracking will have a profound impact. Having a ML sys- 
tem perceiving a human in their own space, passively finding patterns of both 
human-human and human-machine interaction, allows for large amounts of 
correlated, user specific, data to be collected. From this data, the function 
modelling the motivating example (e.g. “when I do this, I want this to hap- 


pen”) can be efficiently and effectively approximated. 
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Appendix B 
A Brief History of Chatbots 


In this section, we describe the history of chatbot research from early rule- 
based models to contemporary corporate chat bots, competitions inspired by 


the Turing test, and open-source frameworks for rapid chatbot development. 


B.1 ELIZA from Joseph Weizenbaum 1966 


Developed by Dr. Joseph Weizenbaum at the MIT AI Laboratory, ELIZA was 
designed to simulate human conversation by matching user inputs to scripted 
responses [344]. It gave an illusion of understanding but had no framework 
for context understanding. ELIZA used pattern matching rules to map inputs 
to outputs. The input was searched for keyword-specific phrases, and when a 
match was found the system output a preprogrammed response. If a match 
was not found, ELIZA output a fall-back response. ELIZA’s DOCTOR script 
was modelled after Rogerian psychotherapy [281], wherein the positive and ac- 
commodating system assumed the pose of knowing almost nothing about the 
world of the human participant. It responded to inputs by drawing additional 
information out of the human participant through unconditional positive re- 
gard. Weizenbaum noted that individuals would have private conversations 
with the system. His secretary is said to have asked him to leave the room 
when she chatted with ELIZA. Other individuals pointed out privacy issues 


in storing and analyzing conversations [26]. 
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B.2 PARRY from Kenneth Colby 1968 


Given the popularity of ELIZA’s development, the expert psychiatrist Kenneth 
Colby was inspired to create PARRY with the Stanford Artificial Intelligence 
Laboratory [57]. PARRY was designed to simulate the dialogue of a paranoid 
schizophrenic. Colby believed that models of the mind could offer a scientific 
framework for the study of cognitive sciences. PARRY mimicked a paranoid 
individual, misinterpreting what people were saying, assuming and imbuing 
negative intentions on human interlocutors, and not providing responses relat- 
ing to certain ‘sensitive’ topics. These rules were hand coded by Colby using 
a similar pattern matching response model as ELIZA. 

While ELIZA was the first chatbot to have assumed a consistent persona, 
with a mental model of a conversation, the model was constant between dif- 
ferent sessions and never adapted. PARRY introduced several advances repre- 
senting an immense engineering effort. Colby designed and implemented a rich 
control structure imitating a human mental condition with a conversational 
model. PARRY used a mapping of complex concepts into simplified abstrac- 
tions, had a better understanding of language than ELIZA, and used affective 
variables for measuring the emotional valence of the ongoing conversation. 
PARRY modelled the anger, fear and mistrust underlying the conversation 
and used these variables in the control flow logic [56]. 

In 1971, PARRY was regarded the first chatbot to pass the Turing test 
when, as Colby describes, 48% of psychiatrists were able to differentiate 
PARRY and a real paranoid human patient [56], [289]. In January 1973, 
PARRY and ELIZA conversed, representing the first time two chatbots en- 
gaged in a dialogue together [47]. 





‘https: //goo.gl/vmd13Z, https://tools.ietf.org/html/rfc439 
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B.3 GUS 1977, Dr. Sbaitso 1991, and Jab- 
berywacky 1997 


GUS (the Genial Understander System) was a knowledge structure-driven di- 
alogue system introduced in 1977 [29]. It was intended to engage in a co- 
operative dialogue driven towards accomplishing specific goals in a restricted 
domains. The GUS research formally introduced the notion of domain restric- 
tion. Specialization in subject matter allowed GUS to achieve a level of realism 
without broad general intelligence. 

Another important take away from the GUS research is the distinction 
between ‘real’ and ‘realistic’ dialogue [29]. The experimenters performed a 
Wizard-of-Oz inspired experiment [18], by having humans interact with what 
they thought was GUS, but, was in fact the experimenter sitting at another 
computer terminal. The appearance was thus that of a realistic approximation 
of a computer output. This protocol is echoed in Chapter 4. The protocol is 
examined in detail in research which explored the reverse setting, interacting 
with a human whose words are determined by a computer [61]. 

Dr. Sbaitso (first released in 1991) is often regarded as the first chatbot to 
make use of advances in TTS software to synthesize the output to audio [69].? 
Dr. Sbaitso was a program created by the company Creative Labs for MS- 
DOS based personal computers. Similar to ELIZA, the system assumed the 
role of a psychologist when interacting with humans. 

Jabberwacky was a chatbot project by Rollo Carpenter which began de- 
velopment in the 1980’s and was released online in 1997.2 Jabberwacky was 
designed to “simulate natural human chat in an interesting, entertaining, and 
humorous manner”.* It served as the pre-cursor to Carpenter’s popular con- 


temporary chatbot Cleverbot, which is described later in this section. 





*https://goo.gl1/gcnjPM 
3http://bit.ly/2IT1isCz 
“nttp://www. jabberwacky.com/j2about 
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B.4 ALICE 1995, AIML, The Loebner Prize 


ALICE, the Artificial Linguistic Internet Computer Entity, was an NLP-based 
bot released in 1995. ALICE applied heuristic pattern matching rules to hu- 
man inputs to engage in conversation. While ALICE never passed a Turing 
test, incremental upgrades to the algorithm allowed it to win the Loebner Prize 
in 2000, 2001, and 2004. 

The Artificial Intelligence Markup Language (AIML) formed the basis for 
ALICE’s conversational model [202]. AIML is an extensible markup language 
(XML) dialect which the chatbot development community has adopted to 
create and share natural language software agents. ALICE’s AIML set was 
released under GNU GPL», a free software license which guarantees end users 
freedom to run, study, share, and modify software. This allows developers 
around the world to iterate and improve on conversational rules to create 
chatbots of their own. The AIML rules developed for ALICE form the basis 
of some of the most popular chatbots today, including the 2018 Loebner prize 
winner Mitsuku. 

The Loebner Prize, launched in 1990 by Hugh Loebner, is an annual compe- 
tition which awards prizes to the most human-like computer programs [213].° 
Inspired by the Turing test, the competition offers two one-time awards which 
have yet to be won. The first award is for a program that judges cannot 
distinguish from a real human over the course of a conversational dialogue. 
The second award is for a program which additionally understands combined 
textual, auditory, and visual inputs in relation to the ongoing dialogue. 

Several of the rules of the current Loebner competition serve as impor- 
tant criterion for assessing chatbot systems. Since 2010 the competition has 
allowed judges 25 minutes of continuous conversation with vocabulary from 
a dictionary suitable for children under the age of 12. Systems do not need 
to response ‘intelligently’ for answers to be accepted. The challenge serves 


less as a test of intelligence and more as a means to compare current chatbot 





“https: //www.gnu.org/licenses/gpl.txt 
Shttps://aisb. org.uk/events/loebner-prize 
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technology under human interrogation. 


B.5 ActiveBuddy Inc. SmarterChild 2001 


SmarterChild was chatbot made by ActiveBuddy Inc. (later renamed to Col- 
loquis) in 2001. SmarterChild was widely distributed across SMS networks 
and friend lists of popular social networks at the time, including AOL Instant 
Messenger and MSN Messenger [12].’ SmarterChild was created to provide 
chat-based access to news, weather, sports results, and information. The main 
innovations behind SmarterChild were that it was connected to a large knowl- 
edge base and retained used information from session to session. SmarterChild 
offered a fun personalized conversation is often considered a direct precursor 
to modern day corporate bots described below. The Microsoft acquisition of 


Colloquis in 2007 marked the continued rise of corporate chatbots. 


B.6 The Rise of Corporate Bots 


Industry based voice user interfaces entered the market of the start of the new 
millennium, including Apple Siri, Google Now, Amazon Alexa, IBM Watson, 
and Microsoft’s Cortana, Xiaolce, and Tay. Many of these early releases have 
been iterated on, improved through user testing and feedback, and now have 
dedicated user bases and market share. These systems make no attempt to 
fool or deceive human users, and instead opt to embrace their Al-backend to 
provide rapid information retrieval and search query responses. 

IBM Watson (in development since 2006) was designed to compete on the 
popular television show Jeopardy, where—in an international event which 
echoed the 1996 chess computer IBM DeepBlue [39] beating Garry Kasparov-it 
beat two of the show’s former champions. Watson succeeded through advances 
in NLP and large database information retrieval methods as well as and tuned 


performance to the Jeopardy specific answer-question format. 





"https: //goo.g1/1069nb 
Shttps://goo.gl/2BaLsX 
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Siri (first released in 2010) is an intelligent personal assistant, part of Ap- 
ple operating systems and uses a Natural Language UI to answer questions 
and perform various requests.? Google Assistant (first released in 2012) is a 
natural language interface to answer questions, provide recommendations, and 
perform actions by passing requests to connected services.!? Amazon Alexa 
(first released in 2015) is an intelligent personal assistant deployed on Ama- 
zon Echo devices.'! Microsoft Cortana (first released in 2015) is an intelligent 


personal assistant which uses the Bing search engine.'? 


B.7 Social Bots, Cleverbot, Mitsuku 


Facebook released the Messenger Platform for Facebook Messenger in 2016 
which ushered in a new era for social bots. It allowed for the creation of 
conversational agents from smaller companies, even those lacking advanced AI 
methods. As of May 2, 2018, there were over 300,000 monthly active bots with 
over 8 billion messages per day exchanged on the platform.!° 

Microsoft Tay (released in 2016) was a chatbot designed to mimic the tex- 
tual interaction habits of a teenage girl. It became controversial when it started 
posting offensive messages on Twitter. It was shutdown and taken offline 16 
hours after it was launched.'* Microsoft released the successor to Tay, named 
Zo, later in 2016. Zo is modelled after the largely popular XiaoIce (released 
in 2014 China) which claims about 660 million active users worldwide.'> The 
Xiaolce platform has been adapted for release in 3 additional countries includ- 
ing Japan (2015), India (2017), and Indonesia (2017). The Xiaolce platform 
4,16 


has been a poet, singer, reader of audio-books, radio hos 


Cleverbot is a chatbot web application created by Rollo Carpenter in 1997 





°hnttps: //www.apple.com/siri/ 

°https://assistant.google.com/ 

'nttps://developer.amazon.com/alexa 
*https://www.microsoft.com/en-ca/windows/cortana 

3https: //blog.messengerdevelopers . com/messenger-at-f8-2018-44010dc9d2ea 
“nttps://web.archive.org/web/20160414074049/https://www.tay.ai/ 
Shttp://www.chinadaily.com.cn/a/201807/28/WS5b5baf5ea31031a351e90b14. html 
Shttps://www.msxiaobing.com/ 
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based on his earlier work on Jabberwacky.!’ One of the major advancements 
in Cleverbot was, true to its name, quite clever. Rather than pre-programming 
responses, Cleverbot responds by searching a large database of past interac- 
tions and responded with the response a human provided to a similar input. 
Another advancement which set Cleverbot apart is the engineering infrastruc- 
ture, underlying the public interface, which serves thousands of concurrent 
requests.'® Cleverbot has passed the Turing test. [96]. 

The final rule-based chatbot discussed is Mitsuku developed by Steve Wor- 
swick.!? Mitsuku is the Loebner Prize winner in 2013, 2016, 2017 and 2018. 
Mitsuku claims the persona of an 18-year-old female from Leeds, England and 
includes many of ALICE’s free, open-source AIML rules.?° Mitsuku’s con- 
versational model includes an ability to reason over simple linguistic logic. 


Mitsuku releases consenting human participant conversations. 


B.8 Chatbot Competitions 


There have been several chatbot competitions over the decades of conversa- 
tional research. These include The Conversational Intelligence Challenge 2 
(ConvAI2)?! by DeepPavlov at the Neural Information Processing Systems 


22 and the Loebner prize.2* These com- 


conference, the Amazon Alexa Prize 
petitions aim to put systems on equal footing so that they might be compared. 

These competitions have attracted critics, most notably the American cog- 
nitive scientist and AI researcher Marvin Minsky denounced the Loebner Prize 
challenge and offered a prize for stopping the competition [86]. That said, these 
competitions echo the ideas of Turing (1950) introduced at the beginning of 


this section [330]. They aim to define an objective function so that we may 


compare models and measure progress. They motivate, inspire, and challenge 





https: //www.cleverbot.com/ 
Shttp://existor.com/2014/02/05/deep-context-through-parallel-processing 
lSnttp://www.square-bear.co.uk/mitsuku/home.htm 
20nttp://www.square-bear.co.uk/aim1/ 

2lhttp://convai.io/ 

22nttps://developer.amazon.com/alexaprize 
°3nttps://www.aisb.org.uk/events/loebner-prize 
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the research community to rally together, to share results, and improve com- 


munication together. 


B.9 The Future of Chatbots 


With open source-code, open data, and open frameworks for the rapid devel- 
opment of chatbot systems, developers can rapidly build working prototypes 
to invite user interaction. Frameworks for chatbot development include: Pan- 
dorabots™*, Dialogflow?® (formerly Api.ai), The Personality Forge”, and the 


t?” and Rasa? frameworks continue to support millions 


open-source Chatterbo 
of interactions. This push towards open access serves to elevate and improve 
future generations of chatbots. 

From the relatively simple rules of ELIZA to the clever advancement of us- 
ing previous conversations with humans for meaningful responses, there have 
been many advances in rule-based chatbots. Many modern chatbot systems 
use some of the rules developed over these last six decades of development. 
They incorporate similar information retrieval-based methods to chose re- 
sponses by finding prior turns which match the current input. Semantic simi- 
larity, in a latent embedding space, can be used with fast information-retrieval 
methods to serve as baseline conversational models [16]. This technology forms 
the basis of the open-source jann (or Just Approximate Nearest Neighbour).?° 

Additional details on the history of chatbot challenges and opportunities 
are covered in review articles [36], [309]. As chatbot use increases, the data 
available to train these systems grows. Chatbots of the future will run on com- 
plex heuristics which combine fast, information retrieval methods and neural 
language models trained on massive datasets. These systems will rapidly adapt 


to language, personality, emotion, and conversation style of the humans they 


interact with. 





24nttps://home.pandorabots.com/home. html 
>nttps://dialogflow.com/ 
*Snttps://www.personalityforge.com/ 
"https: //github.com/gunthercox/ChatterBot 
*8https://rasa.com/ 
°nttps://github.com/korymath/jann 
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Appendix C 


Neural Model Research and 
Future Directions 


Several important implementation and evaluation details as well as ongoing 
research directions for corpus based chatbot development discussed below. 
Additional background information can be found in a 2018 review of training 


tips for deep neural networks for language generation [265]. 


C.1 Additional Evaluation Methods 


Information retrieval-based metrics have been used to evaluate dialogue mod- 
els in past research [194], [291]. Recall measures how many of the actions in the 
real response are predicted correctly, and precision measures the proportion of 
correct actions among all the predicted actions. F score is a binary classifi- 
cation measure of a tests accuracy which considers both precision and recall. 
Formally, it is the harmonic average of precision and recall, which reaches 1 
at perfect precision and recall, and 0 at worst performance. 

hits@k (e.g. precision nat k (Pn@k), recall n at k (Rn@k)) are a family of 
IR-derived metrics which can be computed for any model that can rank a set 
of candidate responses. This allows for comparison of retrieval-based and gen- 
erative models capable of assigning probabilities to candidate responses [364]. 
For instance, Rn@k corresponds to selecting N random distractor responses 
and them measuring how often the model selects the k best responses among 


them. This results in a score between 1 (the model chooses the correct re- 
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sponses) and 0 (the model chooses no correct responses). Mean reciprocal 
rank (MRR) and mean average precision (MAP) have been used to evaluate 
retrieval based dialogue systems (see Section 5) [356]. 

Embedding-based metrics which consider the semantics of words or sen- 
tences using embeddings have been used to evaluate dialogue systems [104], 
[154]. These embedding-based methods can be sensitive to input data used to 
train the embeddings and unlikely informative key words being favored over 
common words [188]. 

Improvements to objective functions have been made and methods to learn 
an automated scoring function for open-domain dialogue have been intro- 
duced [177]. These methods include ADEM [193], RUBER [322], and the 
multi-lingual ADVMT [328]. While these new methods show promise, there 
are still limitations. Specifically, it is unclear how well these methods transfer 
knowledge from one data distribution to another. 

While many automated metrics methods have been adopted by researchers 
and others, it has often not been because they have shown close correlation 
with human evaluators for conversational dialogue [177], [188], [280], [313]. 
Likely, it has been due to the time/cost intensive methods for procuring robust 
human feedback [188]. Qualitative evaluation is costly in terms of time and 
money. It is also noisy, as the standardization between subjects is difficult and 


often subjective [364]. 


C.2 Reinforcement Learning in Dialogue Gen- 
eration 


Generating meaningful conversational outputs from inputs is a difficult se- 
quential modelling task and conversations take place over a series of tens or 
hundreds of interactions [343]. Li et al. (2016) introduce a reinforcement 
learning (RL) framework for neural response generation by simulating dia- 
logue between agents [179]. The authors trained a SEQ2SEQ model (LSTM 
encoder-decoder), then maximized a novel reward function using policy gra- 


dient to reinforce the generation and selection of candidates [350]. This rein- 
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forced candidates which where: 1) easy to answer (related to the likelihood of 
responded to the given candidate with a dull response), 2) contributed new 
information to the conversation (related to the a distance metric measured be- 
tween hidden representations of utterance and response in the decoder) [177], 
and 3) had a notion of semantic coherence and mutual information (measured 
using a combination of forward and backward likelihood) [179]. The authors 
showed how to integrate these goals, by applying deep RL to model future 
reward in chatbot dialogue [179]. 

MILABOT is an RL-based chatbot developed by the Montreal Institute for 
Learning Algorithms (MILA) for the Amazon Alexa Prize competition [299]. It 
consists of an ensemble of models and combines corpus-based neural generation 
SEQ2SEQ models, information retrieval models, and rule-based models built 
on template-based matching. The MILABOT research project presents a large 
user study on using RL algorithms on crowd-sourced data from the real-world 


user interactions to improve response candidate selection [299]. 


C.3 Vocabulary Size and Approximating the 
Softmax 


To consider the question of vocabulary size it is important to get a frame of 
reference of the order of magnitude of vocabulary [243]. How many words 
are there in the English language? There are over 170,000 words in current 
use, almost 50,000 obsolete words, and 9,500 derivative words as sub-entries 
in the Second Edition of the Oxford English Dictionary.' How many words 
are needed to do the things a language user needs to do? Average native 
speakers know somewhere between 15,000 and 20,000 word roots [360]. To 
achieve pleasurable reading it was found to be necessary for readers to have a 
vocabulary around 5000 words [118]. How many words are used in day-to-day 
life? 3,000 words will cover 95% of everyday writing [89], [238]. 

For conversational dialogue generation, one must select a reasonable vo- 


cabulary size early in development. A simple heuristic is to only use tokens 





'https://goo.gl/JKaSzN 
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which appear a certain number of times in the corpus. Alternatively, given 
a memory or time budget requirement, one can fix a size for the vocabulary 
(using some of the orders of magnitude presented above). Generally, smaller 
vocabularies can lead to smaller loss, gives ‘OOV’ answers to more inputs, and 
do not train much faster than a big vocabulary [142], [150]. 

To understand why the vocabulary size is a performance bottleneck, it will 
help to understand the softmax function in the context of neural networks. The 
softmax function is often used in the output of a neural network predicting 
the most likely next word. A softmax function, or normalized exponential 
function, takes a vector of real-valued scores and squashes it to a vector of 
values between zero and one that sum to one. This squashed vector can be 
sampled from as a probability distribution. 

Given a sequence of m words in a training corpus (w1, ..., Wm) all of which 
belong to a set of words or vocabulary V of size |V|. LMs consider a context c of 
past words. Each word w; in the input has an associated vector representation, 
Uw; If h is the hidden layer, the probability of a word w given a context c can 
be computed with the softmax function. 
exp (h! vw) 


wey EXP (AT Vu) 





p(wlc) = 3 


Computing this function is expensive as the inner product between h and 
the embedding over every word in the vocabulary needs to be computed to 
calculate the denominator. Thus, the complexity of computing the final layer 
in these sequence generation models is directly proportional to the size of the 
vocabulary. 

Decreasing the complexity of computing the final layer has been an ac- 
tive area of research in language modelling [150]. One solution that has been 
proposed is to approximate the negative term of the gradient by importance 
sampling with a small number of samples [142]. That is, at each step, update 
vectors associated with the correct word and a subsample of incorrect words. 
Then after training, the full target vocabulary is used to compute the out- 


put probability of each target word [142]. Ruder (2016) describes additional 
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complexity optimizations for approximating the final layer of these models.” 
Additionally, there is an informal whitepaper which describes common candi- 


date sampling and softmax approximation algorithms [1]. 


C.4 Word Embeddings 


The choice of embedding space of the vocabulary has significant performance 
implications. Learned word embeddings, such as word2vec, continuous bag-of- 
words, and Skip-Gram models have been exceptionally successful in many NLP 
tasks [226]. The state-of-the-art in word embeddings continues to progress [46], 
[70]. These pre-trained semantic embeddings are useful for many tasks, includ- 


ing natural language generation.* 


C.5 Topic Model 


As we are exploring topics of natural language processing, it is helpful to intro- 
duce topic models, or probabilistic topic models. A topic model is a statistical 
model for discovering abstract ‘topics’ or underlying semantic structure, occur- 
ing in a set of documents. The topics produced by topics models are clusters 
of similar words. Latent Dirichlet allocation (LDA) is a generative statistical 
model. LDA is built on the hypothesis that each document in a set of doc- 
uments is a mixture of a small number of topics, and that word presence is 
attributable to one of the document’s topics [25]. The innovation in LDA was 
that the Dirichlet prior captured the notion that documents cover only a small 


set of all possible topics and that topics use a small set of words frequently. 


C.6 Batching and Bucketing 


As with most neural network training, processing occurs in parallel. As previ- 
ously mentioned, padding with ‘PAD’ tokens allows for fixed width batches to 


be processed together. This improves processing performance over a naive 





Sebastian Ruder. http: //ruder.io/word-embeddings-softmax/, 2016. 
3https://www.tensorflow.org/extras/candidate_sampling. pdf 
4nttp://ruder.io/nlp-imagenet/ 
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batching scheme for variable length sequences [49]. Generally, too much 
padding leads to extraneous computation and pre-grouping input sequences 
of similar lengths into the same training buckets can improve performance. 
Many deep learning frameworks’, and sequence modelling software (including 


Tensor2Tensor®) often has dynamic bucketing for batches built in. 


C.7 Editing Outputs 


Conversational consistency is a critical component of chatbots. One tech- 
nique used to enforce consistency is to inject information directly from the 
input sequence at the decoder. This kind of naive approach has been shown 
to work well on difficult NLP tasks, including LAMBADA sentence comple- 
tion [249]. By retaining information from the source, the model capacity can 
be used to generate generic sentences which are then improved with consistent 
source-derived information. Similarly, editing prototypes allows for a model to 
generate a candidate with a SEQ2SEQ model and then update the prototype 


sentence with another transduction model [106]. 


C.8 ‘Training on Logs 


Models may be improved by saving chat logs and tuning deep models with ad- 
ditional training on newly collected conversations, some improvement may be 
gleaned [104], [346]. The model has capacity to generate interesting responses 
but may be limited by differences between the training data and the test data 
distributions. By training on sentences from actual human interactions, the 
model’s coverage of the true distribution can be fine-tuned. While this might 
be a drop in the bucket in comparison with the huge amount of initial training 
data, updating only a single layer through ongoing interaction can allow for 


fine tuning [70], [128]. 





“https: //goo.g1/8R73Y3 
Shttps://tensorflow. github. io/tensor2tensor/overview. html 
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Appendix D 


The History of Improvising 
Intelligences 


At any rate it originated in improvisation-—both tragedy itself and comedy 


— Aristotle, Poetics, c. 335 BC 


It is helpful to understand how improvisation developed when thinking 
about how it might be useful for scientific investigation. Several excellent 
in-depth historical accounts of the development and progression of improvisa- 
tional theatre cover the specific details [88], [288]. 

Story telling is as old as humankind; stories told verbally, perhaps with 
accompanying rock art, predates writing. The earliest written mentions of im- 
provisation, like much foundational dramatic theory, stem from the Rhapsodes 
of Ancient Greece [58]. Aristotle may have developed his theories on tragedy 
and comedy by describing different ways to act improvisationally [107], [108]. 
Improv theatre traces its roots to Ancient Rome, where masked improvised 
farces, the Atellan Farces, were performed in 90 BC [311]. Commedia dell’ Arte, 
popular in Italy from the 16th to 18th centuries, saw theatrical performers 
loosely following a broad outline of a script [167].1 Improvisational techniques 
and processes were adopted by influential modern theatre directors, including 
Konstantin Stanislavsky [20] and Jacques Copeau [60] for actor training and 


process work. 





'nttp://www.factionoffools.org/history 
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Improv theatre often incorporates suggestions (prompts or call-outs) from 
a live audience. Some credit American Dudley Riggs, a vaudevillian performer, 
as the first to incorporate audience suggestions [279]. Others assign this credit 
to the psychotherapist, and founder or psychodrama Jacob L. Moreno (1889- 
1974) [288]. Even in its infancy improv was shaped by those performing on 
stage and those using the techniques of improvisation for scientific investiga- 
tion. 

Modern improvisational theatre was popularized for American audiences by 
Neva Boyd and Viola Spolin (who used improvisation for applied social work), 
and innovative theatre practitioners such as Keith Johnstone. Johnstone cre- 
ated modern improvisational theatre formats, including Theatresports™.? 
Keith Johnstone’s principles of improvisation were set out in his early work and 
teaching [147] and his work has been foundational to scientific inquiry [288]. 

Improvisation is, at its core, based on the extemporaneous dialogue which 
supports and structures the performances. As Spolin put it, “the techniques 
of theatre are the techniques of communication” [314]. Her son and improvi- 
sational innovator, Paul Sills, echoed the sentiment a few years later stating 
“true improvisation is a dialogue between people” [321]. The connection be- 
tween how we communicate through conversational dialogue and improvisation 
is not uncommon. Author and musician Stephan Nachmanovitch said “we are 
all improvisors... every conversation is a form of jazz” [239]. 

There are connections between the roots of improvisational theatre and 
the artistic movement of the Surrealists. One theme popular in surrealist 
automata is the notion of collaborative art generation through channeling the 
collective subconscious. Of particular note is the Surrealists’ Cadavres Exquis. 
The Exquisite Corpse arose from this very idea of taking turns in collaborative 
art generation, specifically in drawing and story writing. 

The game is played as follows, on a piece of paper, the first writer writes 
‘the’ and an adjective (e.g. ‘exquisite’) on a piece of paper. Then, they fold 
the paper to hide what they have written and pass it along to the next writer. 


The next writer adds a noun (e.g. ‘corpse’), hides their contribution in a sim- 





"https: //theatresports.org/ 
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ilar way, and passes the paper along. On next three turns, writers contribute 
a verb, a preposition and adjective pair, and finally another noun. The paper 
is then unfolded and read aloud to reveal the collective creation. The first 
of such poems is said to be: “The exquisite — corpse — shall drink — the 
new — wine”, which gave the game its title. Visual Exquisite Corpses can 
also be created using similar contribute and hide conventions, and replacing 
lexical contributions for interpretable, composable illustrations. The Surreal- 
ists’ Exquisite Corpse motivate the human-machine collective creation which 
follows in this thesis. The the poetic nonsensical dialogue generated by these 
artists bear similarity to the examples of human machine communication using 
neural language generation discussed in Section 2.2. 

Improvisation is spontaneous, and spontaneity can reveal motivation, goals, 
and objectives. Brazilian theatre director Augusto Boal connected improvi- 
sation deeper than entertainment and conversation, he connected it directly 
to the self-conscious [28]. “All human beings are actors and [observers]”, he 
noted, arguing that through theatre we build knowledge and experience; we 
rehearse life [27]. The very act of improvisation is human nature [170]. 

These notions of the self-conscious are defined by communication, action, 
observation, and expression. This connects improvisation to a model of artifi- 
cial agents in simulated environments [319]. This connection was made explicit 
in Hoffman (2006).? Hoffman emphasized that embodied cognition is critical 
to understanding, mental and motor expression are inextricably linked, and 
that reaction based performance depends on the interaction. This final point 
echoes the foundations of the work of director Sanford Meisner [225]: for an 
actor to “get out of their head” they should act instinctively to the surrounding 


environment. 





3https://goo.gl/X7quQc 
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Appendix E 


The Shapes of Stories 


A story is an account of an event. A story arc, or narrative arc, is a se- 
quence of narrative points which constitute story. Arcs are used to represent 
narrative tension, and were introduced as dramatic structures by Aristotle in 
Poetics [108]. The arc was coined by Gustav Freytag in a study on five-act 
dramatic structure. In Freytag’s structure narrative arc consists of five points: 
exposition, rising action, climax, falling action, and resolution. This arc serves 
as the underlying shape of Figure E.1. The figure includes additional details 
and narrative points to illustrate a complete narrative. 

The figure includes a notion of a most reportable event. As defined by 
Labov (1997), the most reportable event is less common than any other in 
the narrative and has the greatest effect upon the characters [163]. A story is 
contained in a frame, the frame of the story includes the entire narrative arc 
and is introduced by the abstract. The abstract serves to present the key ideas 
of the story frame. The first step inside of the story frame is the orientation 
(often referred to as the platform in improvisational theory). The orientation 
defines the setting, characters, and relationships of the story. 

At this point, the story proceeds through a series of complicating ac- 
tions, from seemingly inconsequential and normal escalating to reportable and 
unique. The complicating action which incurs the peak tension is the most 
reportable event. From this event comes a minor resolve to provide relief from 
the narrative tension to the audience. The most reportable event then often 


returns and serves to catalyze the resolving actions of the narrative. The story 
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Figure E.1: A detailed story shape arc which combines narrative points from 
Aristotle’s dramatic arc, Freytag’s five-act structure, and Labov’s concept of 
the ‘most reportable event’. 


frame closes with an aftermath, which ties the key ideas of the story to broader 
implications to society. Finally, narratives often conclude with an evaluation, 
this evaluative story step leaves the audience with a moral, message, value, or 


lesson. 
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Appendix F 


Supplementary Material for 
Narrative Arc 


F.1 Data Processing Details 


OpenSubtitles were used as conversation model data.'. The dataset was pre- 
processed by removing duplicate movie subtitle files, lines under 10 charac- 
ters and duplicate lines, resulting in 68,719,885 unique lines. The text2text 
Transformer model from Google’s open source implementation was used for 
training.2 OpenSubtitles were used as training data.® As several files in the 
dataset cover the same uniquely identified movie or television show duplicates 
were removed by keeping only the subtitle file with the most lines for each 
unique ID. The dataset was split into into 119,949 training IDs, 10,000 val- 
idation IDs, and 10,000 testing IDs. The data was prepared for training by 
removing empty lines, duplicate lines, and substituting non-Unicode charac- 
ters. A vocabulary was built using the training set. After cleaning, subtitle 
files with less than 6 lines were excluded. Training data was formatted into 
input/response pairs. The training data was split into 17,136,469 training 
examples and 1, 437,473 evaluation examples. Validation and testing subtitle 
sets were held out to measure task accuracy on unseen data. The hyperparam- 
eters of the Transformer model were set as follows: hidden size of 1024, filter 


size of 4096, batch size of 2048, 16 heads, and a dropout factor of 0.3 was used 





‘http: //opus .nlpl.eu/OpenSubtitles.php 
"https: //github.com/tensorflow/tensor2tensor 
3http://opus.nlpl.eu/OpenSubtitles.php 
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for regularization. The model was trained for 861,000 steps, to convergence, 


with final negative log-perplexity of —2.976 on evaluation set. 


F.2 Rejection Sampling 


Rejection sampling is a means of online sampling that allows for sampling from 
an unknown distribution. Suppose we are given an unnormalized distribution 


q over X which we can query (i.e. G(x)) but not necessarily integrate over. Let 











q be a proper distribution over ¥ such that ĝ is dominated by qM, M E R: 





q(x) < Mad(x), Vne Xx. (F.1) 


The Rejection sampling algorithm to obtain a sample from an unnormalized 


distribution g using samples from a proper distribution q proceeds as follows: 


1. Sample x ~ q and sample r ~ Uniform(0, 1), 


2. Ifr < q(x)/Mq(«), accept x as a sample drawn from q, otherwise reject 


the sample and go to 1. 


This algorithm will take an average of M iterations to obtain a sample. Let 


X be the random element returned by this procedure. 


Proposition 1. Rejection sampling samples x € X with probability 


q(x) 
Pr X = r} = s. 
wey q(x") 
Proof. Write £1, 22,... for the sequence of sampled symbols, and write A; to 


denote the event that the it symbol is accepted (at which point we stop the 


process). Then 
Pr{X = z} = Pr{ A, £1 = £} + (1— Pr{Ai}) Pr{A2, z2 = £} +... 


E E = 2} [] (1 — Pr 4,}). 


Now, this process is memoryless (in a sense we should make a little more 


formal) and 


Pr{A,} = Pr{A;} Pr{ Ai, xi = £} = Pry A1, x1 = T}. 
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Since 1 — Pr{A,} < 1, the geometric sum converges and 


Pri Aya = £} 


Pr{X = zr} = Pay 





which is the conditional probability of x; = x given A;. Now 








Pr{ Aj, 21 = £} = q(x) Pr fr < a} = 1e) fe = S 


where we used (F.1) to guarantee that fe, < 1. But then 





Pr{Ai} = `. Pr{ Ai, 2, = x} = a X. q(2’). 





L'EX DEX 
We conclude that 
Pr{ A, z1 = x} ralz) q(x) 
pr{x =o) = Pen O ie) 
r{Ai} woe de) a2) 
as desired. 














F.3 Additional Narrative Arcs Underlying 


Existing Dialogues 


We present an additional set of narrative arcs underlying 20-utterance scenes 


in Figure 2 and discussed in Section 3.1. In the paper body we report results 


on the Romeo and Juliet scene. 


1. Figure F.1. “Blade Runner” Hampton Fancher and David People. Ada- 
pated from a novel Philip K. Dick (1982). Last 20 lines of director’s cut 


movie script. 


2. Figure F.2. “The Great Dictator” Charlie Chaplin. Los Angeles, CA: 


United Artists (1940). First 20 lines of Charlie Chaplin’s speech. 


3. Figure F.3. “Dead Parrot Sketch” Monty Python’s Flying Circus, 


Episode 8 (1969). First 20 lines. 


4. “Romeo and Juliet” William Shakespeare. Vol. 1. JB Lippincott, 


(1871). Act 1, Scene 1, first 20 lines. 
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Narrative arc: Newsgroups 








Utterances: Blade Runner 




































































1.0 
. etre: 1: You're not in pain, are you? Are you in pain ? 
Sci =a 2: You better get it up or I'm gonna have to kill you! 
— religion Unless you're alive you can’t play, and if you don’t 
= tlk play ... six, seven. Go to hell or go to heaven . 
3: Good, that's the spirit. 
Rec 4: That hurt. That was irrational of you, not to mention 
unsportsmanlike. Where are you going ? 
A . z 5: Quite an experience to live in fear, isn't it? That’s 
Narrative arc: Movies what it is to be a slave . 
=A 6: I've seen things you people wouldn't believe. Attack 
Drama — horor fim ships on fire off the shoulder of Orion. 
— fim adaptation 7: I watched C-beams glitter in the dark near the 
nut Tannhauser gate. All those moments will be lost in time 
— silent film : : . ; A 
SS ee es like tears in rain. Time to die. 
Horror — aime fim 8: I watched him die all night. It was a long slow thing. 
— drama fim 9: He never whimpered, and he never quit. He took all the 
— thriler fim time he had, as though he loved life very much, every 
Narrative arc: DeepMoji — romantic_comedy second of it, even the pain. Then he was dead. 
10: You've done a man’s job sir. But, are you sure you are a man? 
1.0 ZT 1l: It’s hard to tell who's who around here.. 
S Sad — |m 12: I guess you're through, huh? 
2 — |” 13: Finished. 
os] A E aa 14: It’s too bad she won't live. But then again, who does? 
8 ad [a 15: Rachel? Rachel? Rachel? 
f Misc. — | music 16: Do you love me? 
a = | 1e 17: I love you. 
0.0 18: Do you trust me? 
0 4 6 8 10 12 1 #16 18 20 19: I trust you. 
Utterance 20: Rachel? 
Figure F.1: Narrative Arc of “Blade Runner” 
Narrative arc: Newsgroups Utterances: Dictator 
1.0 — comp 
= fee 1: I'm sorry, but I don't want to be an emperor. 
Talk = ši 2: That's not my business. 
Li 3: I don't want to rule or conquer anyone. 
Ee 4: I should like to help everyone if possible. 
5: We all want to help one another. 
6: Human beings are like that. 
7: We want to live by each other's happiness - not by each 
Narrative arc: Movies other's misery. 
1.0 8: We don't want to hate and despise one another. 
= E 9: In this world there's room for everyone and the good earth is 
Bz rama (Ss sini tannin rich and can provide for everyone. 
5 = atone 10: The way of life can be free and beautiful, but we have lost 
w 0-54 Comedy P |— stent jim the way. 
3 — romance film 11: Greed has poisoned men's souls - has barricaded the world 
= = aime film with hate - has goose-stepped us into misery and bloodshed. 
0.0 = : We have developed speed, but we have shut ourselves in. 
s= : Machinery that gives abundance has left us in want. 


Narrative arc: DeepMoji 




















Our knowledge has made us cynical; our cleverness, hard and unkind. 
We think too much and feel too little. 
More than machinery we need humanity. 


: More than cleverness, we need kindness and gentleness. 

















The very 
goodness 
unity of 
20: Even now 
millions 


Without these qualities, life will be violent and all will be lost. 
The aeroplane and the radio have brought us closer together. 


nature of these inventions cries out for the 

in man - cries for universal brotherhood - for the 
us all. 

my voice is reaching millions throughout the world - 
of despairing men, women, and little children - 


victims of a system that makes men torture and imprison 
innocent people. 


Figure F.2: Narrative Arc of “The Great Dictator” 


Utterances: Parrot 
























































0 4 6 8 10 12 14 16 18 20 
Utterance 
Narrative arc: Newsgroups 
1.0 — omp 
Rec — rec 
— religion 
F — talk 
Talk 
Narrative arc: Movies 
1.0. — comedy film 
— horrorfilm 
D Drama} |— tim saaptetion 
Bos] | |= tenn 
2 — = romance_film 
£ Com = arme fim 
— drama fiim 
0.0 — thrillerfilm 
= ti d) 
Narrative arc: DeepMoji —one 
1.0 — 
Š Misc, - 
2 -= 
= = 
0.54 H |= 
8 Glad = 
o — 
£ 
a = 
0.0 
0 4 6 8 10 12 14 16 18 20 


Utterance 


'Elio, I wish to register a complaint. 

'Elio, Miss? 

What do you mean "miss"? 

I'm sorry, I have a cold. I wish to make a complaint! 

We're closin' for lunch. 

Never mind that, my lad. I wish to complain about this parrot 
what I purchased not half an hour ago from this very boutique. 
Oh yes, the Norwegian Blue. What's, uh, What's wrong with it? 
I'll tell you what's wrong with it, my lad. 'E's dead, that’s 
what's wrong with it! 

No, no, ‘e's uh,...he's resting. 

Look, matey, I know a dead parrot when I see one, and I'm 
looking at one right now. 





ll: No no he's not dead, he's, he's restin'! Remarkable bird, the 
Norwegian Blue, idn'it, ay? Beautiful plumage! 

12: The plumage don't enter into it. It's stone dead. 

13: Nononono, no, no! 'E's resting! 

14: All right then, if he's restin', I'll wake him up! ‘Elio, 
Mister Polly Parrot! 
I've got a lovely fresh cuttle fish for you if you show. 

15: There, he moved! 

16: No, he didn't, that was you hitting the cage! 

17: I never!! 

18: Yes, you did! 

19: I never, never did anything... 

20: 'ELLO POLLY!!!!! Testing! Testing! Testing! Testing! This is 


your nine o'clock alarm call! 


Figure F.3: Narrative Arc of “Dead Parrot Sketch” 
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F.4  Improvisor Transcripts 
Subject1, Mode: Conceal 


HUMAN: Good morning dad, is breakfast ready? 

SYSTEM: Breakfast is ready. 

HUMAN: Great! What are we having? 

SYSTEM: We are having Lamb Chops and peas for dinner. 

HUMAN: Yummmm. But what are we having for breakfast? 

SYSTEM: some breakfast. 

HUMAN: Ok, thanks dad. I have a big ball game later, I would love if 
you and mom came to watch. 

SYSTEM: Likewise, Dad will fire up the chainsaw and be ready to cut 
through the big limbs at the joints as you work. 

HUMAN: So I guess you’re not coming to the game. You work too much 
to be good father. 

SYSTEM: I’m doing the best I can as a father, and I hope that’s good 
enough. 

HUMAN: It’s not. I’m moving out unless you change. 

SYSTEM: If you will leave me to it, I will pack up my things and move 


out as quickly as I can so as not to inconvenience you. 


Subject1, Mode: Neutral 


HUMAN: Rosanne, thanks for meeting me under the oak tree 

SYSTEM: Keryn would also like to thank her cousins, Dana, Jason, Staci, 
and Ryan, and her aunt Marge Naegele for continually supporting 
her decisions even when they thought she was crazy. 

HUMAN: It means the world that you all came to meet here today as we 
prepare for another one of my crazy ideas 

SYSTEM: We shall all be here at four o’clock. 

HUMAN: Who else is coming? 


SYSTEM: A lot of people are coming. 
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HUMAN: How many? 

SYSTEM: How many and where? 

HUMAN: We need at least 200 here at the oak tree 

SYSTEM: Gonna need at least three or four hundred. 

HUMAN: You're right. How fast can they get here? 

SYSTEM: however fast they are travelling. 

HUMAN: Well I can’t argue there. Prepare the tamborines. We all need 
tamborines for my crazy idea. 


SYSTEM: But we’re depositing the money—you’ll need a chequebook. 


Subject1, Mode: Reveal 


HUMAN: Welcome to tim hortons, what can I get you? 

SYSTEM: IIl give you doughnuts and coffee with milk. 

HUMAN: Great work Jason, you’re doing great on your first day on the 
job 

SYSTEM: First day Welcome the employee! 

HUMAN: That’s right Jason. You’re very welcome here. If you do a good 
job, you’ll be a manager in no time. 

SYSTEM: Being responsible and knowledgeable is possible in a participa- 
tive environment if the manager is a member of the overall team. 

HUMAN: Wow. Jason you continue to surprise me. Where did you learn 
that? 

SYSTEM: The next time I go up Pll get the information for you, Jason 
told him. 

HUMAN: Jason? Are you ok? 

SYSTEM: JASON: I’m glad you’re feeling better. 

HUMAN: You're starting to scare me 


SYSTEM: And I frighten you. 
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Subject2, Mode: Conceal 


HUMAN: Stacey! How are you? I’m excited that we’re getting our nails 
done for Sams wedding tomorrow! 

SYSTEM: I have to tell you I’m so excited for you! 

HUMAN: Thank you! Maid of Honour is a real honour! Did you remem- 
ber to get them a gift? 

SYSTEM: I am so, so glad you’re my maid of honor. 

HUMAN: And? 

SYSTEM: About what? 

HUMAN: Did you remember to get your parents a gift? They are hosting 
your wedding after all. 

SYSTEM: Otherwise, plan to give them their gifts at the rehearsal dinner. 

HUMAN: Oh that’s a good idea! Hopefully they won’t need it earlier. 

SYSTEM: They also need it in time so they can get it in the fridge to 
chill. 


Subject2, Mode: Neutral 


HUMAN: Good Morning! I was wondering if you had any butcher knives 
for sale? 

SYSTEM: I had an extra one, roast carving size, sent me by the then 
president of Macy’s last summer, but when Mr. Case sent me two 
of his sets I gave this odd knife to my part-time cook. 

HUMAN: Wow! May I see it? I’m getting a... gift... for my boyfriend. 

SYSTEM: I sent it to you because you are now a man and it’s time for 
you to look at the future and fulfill the promise I gave to you in 
the desert—that I’d see you married. 

HUMAN: Why would I want to make you happy? We haven’t spoken in 
years! 


SYSTEM: Because I saw how kind your heart was? 
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HUMAN: I have missed you. I don’t know if I have what it takes. Am I 
ready to commit? Am I ready to take care of someone other than 
myself? 

SYSTEM: I am ready, willing, and worthy to speak with you right here 
and right now. 

HUMAN: Will you teach me the butcher ways. So that I may follow in 
your steps? 

SYSTEM: I do it the easy way: I ask the butcher to score it; his knives 


are better than mine, for a start. 


Subject2, Mode: Reveal 


HUMAN: Crikey Mate! Did you put the shrimp on the barby yet? 

SYSTEM: We started the shrimp over a hot fire, then finished them over 
low heat to allow them to cook through gently. 

HUMAN: Well that sounds delicious! We need to impress the Boss 
tonight! Did Karen make her cheese dip too? I can’t get enough 
of that! 

SYSTEM: She makes some and we sit at the table again. 

HUMAN: Sounds great Greg! Listen, I really appreciate you hosting this 
party, and I hope you’re not too upset that I got the promotion 
over you. 

SYSTEM: (11) If Greg came to the party, all hell would break loose. 

HUMAN: Yeah I can’t stand that guy. Why have two Gregs in the office! 
You are all we need, am I right? 

SYSTEM: He’s replacing Laura as our new office girl. 

HUMAN: Sounds like a plan. Well thanks again. Maybe next month 
you'll get the promotion! 

SYSTEM: And you may depend on me to know my place if you get pro- 


moted. 
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Subject3, Mode: Conceal 


HUMAN: Phew! Good run today Meredith. 

SYSTEM: Meredith felt guilty even thinking it, but yesterday had been 
a good day. 

HUMAN: Yesterday was such a good day that Meredith never wanted it 
to end. 

SYSTEM: Now she really needed Meredith to go away. 

HUMAN: She needed to go somewhere new, somewhere for a fresh start. 
‘Tm moving to Brazil!” She shouted. 

SYSTEM: Ava made a lastminute decision to return to America by way 
of Spain, fitting in another visit to Madrid, taking a few days to 
see some of her new friends and acquaintances in that city of which 
she had grown so fond. 

HUMAN: See Ava and Meredith had always wanted to travel together, 
and this could be there last chance, before they were married and 
settled down. 

SYSTEM: During the summer of 1957, when Lorenzo Jr. was fourteen 
and Rani Meredith was nine, the family embarked on a rare train 
trip to visit relatives in Atlantic City, Philadelphia, New York, and 
New Bedford, Massachusetts. 

HUMAN: That was the last time the two friends had been close, so a trip 
was long overdue. 

SYSTEM: The trip to Dinard was ‘a good opportunity for ending the fatal 
friendship that had sprung up between us, and ending it without 


bitterness’. 


Subject3, Mode: Neutral 


HUMAN: Tell me where the gold is! You can’t lie anymore. 
SYSTEM: Gold was found in the bed of the Pactolus. 
HUMAN: Ah so you admit it. You stole from King Pactolus. 
SYSTEM: I didn’t admit it but I didn’t deny it. 
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HUMAN: Is there a difference? Once a thief, always a thief. 

SYSTEM: The same is a thief and a robber.—The former of these words 
means the petty thief who commits the smaller or unobserved rob- 
bery. 

HUMAN: But your robbery was observed. You decided to steal from a 
royal. Something that will cost you dearly. 

SYSTEM: Yet I desired to commit robbery, and did it. 

HUMAN: So you are a free man. And you shall die a free man. 


SYSTEM: You are already free! the 


Subject3, Mode: Reveal 


HUMAN: Oh no! Mom is gonna be home soon! 

SYSTEM: And now you can’t come home. 

HUMAN: Please, just let me inside! It’s getting colder out here! 

SYSTEM: We’ll go inside now. 

HUMAN: Okay thank you! jeez, you’re a crazy older brother. 

SYSTEM: Sometimes being a brother can mean you get mixed up in a 
bit of all of this. 

HUMAN: Yeah totally. At least we can be ready for bed now before mom 
gets back. 

SYSTEM: Do remember mother’s coming in the middle of the day, and 
she’s much more likely to be early than late. 

HUMAN: Exactly! That’s why I’m worried, let’s hurry! 


SYSTEM: Hurry up then! 


Subject4, Mode: Conceal 


HUMAN: Master, I did not mean to disturb you but your estranged 
brother has unfortunately passed away. 


SYSTEM: You did not mean any harm, and I owe you no grudge. 
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HUMAN: Thank you, Master. I have brought you your three favourite 
things to cheer you up on this sad occasion: chocolate, red wine 
and a copy of this month’s Teen Vogue. 

SYSTEM: I am so very glad, as I do not know how I could have endured 
the thought of you & Edward enjoying Oxford life & myself cut off 
from it all for another year. 

HUMAN: Without you, Master, I don’t believe I could enjoy anything 
in this world. You have done so much for Edward and I; we can’t 
thank you enough. Because of this, we will forever be in your debt. 

SYSTEM: I thank you for accepting it; and I believe that I may safely 
promise that you will find the contents of the smaller packet of 
great importance and advantage to yourself and those dear to you. 

HUMAN: I will open this small gift from you in front of you so that you 
may see my reaction. Good Master! This necklace is beautiful. 
Are you sure you wish to give this gift to me on the day of your 
brother’s passing? I am but a lowly servant. 

SYSTEM: Thank you, Lord, that You can use our marriage to call out 
our unique gifts and abilities. 

HUMAN: The Good Lord has graced us with his blessing on this day. 
And you have blessed Edward and I will your compassion and 
grace. Before I depart, is there anything else you require, Master? 

SYSTEM: Thou thought’st to help me, and such thanks I give As one 


near death to those that wish him live. 


Subject4, Mode: Neutral 


HUMAN: Listen here, Karen- y’all don’t come into my bar and tell me 
what the heck to put on my menu. ya hear? 

SYSTEM: I’m not thrilled about our menu, but I don’t want to hurt 
Thom’s feelings. 

HUMAN: Thom is supposed to be a professional. A pro-fess-ion-al. He 
has to take that dang feedback and take it with a smile on his face. 


Just like you do about the decor. 
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SYSTEM: A long day of sessions produced little that Thom was happy 
with; his dissatisfaction can clearly be seen in the studio sequence 
that appeared in Meeting People Is Easy, as he complains to Nigel 
Godrich that they’ve been working for ages and still only two in- 
struments—bass and one guitar—sound passable. 

HUMAN: Thom’s bar was designed to feature bands from across the na- 
tion, but working with Nigel made it difficult to play in a band 
and run a bar. 

SYSTEM: The amount of travel created logistical problems, as the dis- 
tance between venues had not been considered when scheduling 
each performance. 

HUMAN: Bands were consistently arriving late to perform or leaving their 
sets early. And to top it all off, the food catered to the bands from 
the bar left something to be desired. 

SYSTEM: A step up, the Kaiserkellar had a dance floor, a better PA, 
and another band on the bill to take some of the weight; although, 
because they had to start earlier and finish later, they ended up 
playing for six hours most nights. 

HUMAN: And six hours was much too long for such an amateur band and 
tempers were runnning high. Nigel, Thom and Karen just couldn’t 
seem to get along. 

SYSTEM: Indeed, they continued to work astonishingly hard as a live 


band - but now on a truly international scale. 


Subject4, Mode: Reveal 


HUMAN: Danielle, Kyle asked me to the prom but I can’t afford a dress. 

SYSTEM: If you dress her in her new “Prom Pinks” outfit, satin splendor 
with matching coat, gold belt, clutch, and hair bow included, so 
long as you don’t lift her dress, right’?’—who’s to know. 

HUMAN: You're absolutely right! We can borrow from my sister’s closet 
and no one will know it’s hers unless they see her name tag under 


the dress. 
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SYSTEM: And when she finds it, she’ll hide the stuff even better. 

HUMAN: We need to be sneaky about it. We could use a bed sheet and 
hang it on the hangers and she’ll never know. Let’s do it! You 
grab the bedsheets and Pll grab the dress. 

SYSTEM: The pinch of the question is—why, after having provided the 
substitute dress, does she hide the smeared night-gown, instead of 
destroying it? 

HUMAN: She’s embarassed! Kyle asked me to the prom instead of her 
after their date went horribly wrong. She dropped ketchup all over 
herself. 

SYSTEM: She thought that maybe she should just cancel the date and 
forget about the Junior Prom altogether. 

HUMAN: She should have but she didn’t. Plus, Kyle is the Captain of 
the Chess club. He’s the smartest guy around- I couldn’t say no. 
Was I wrong to do that? 

SYSTEM: Kyle would have lost out on the opportunity to get to know 


someone who could be a mentor or valuable resource. 


F.5 Improvisor Assessments 


Qualitative assessment as shared in free form discussion with expert improvi- 
sors during and following the experimental conversations. Words in [brackets] 
were implied in the conversation. When comments made were directly ad- 
dressing a model’s alpha setting, that models alpha parameter is included in 


(bold). 


F.5.1 Subject /Model Specific Comments /Assessment 


e (Subject1, Mode: Reveal): The system brought context to the scene. 
e (Subject1, Mode: Neutral): The system was comically literal. 
e (Subject1, Mode: Conceal): This setting was reactive though felt 


flip-floppy. 
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(Subject2, Mode: Reveal): It is making more sense this time, seems 

to make more sense deeper in conversation when it has more context. 

e (Subject2, Mode: Conceal): The system is making assumptions and 
while there is a chance for conflict it prefers to answer questions vaguely 
than admit guilt. 

e (Subject3, Mode: Neutral): This is a cool conversation, like it 
wanted to guide the conversation. 

e (Subject4, Mode: Reveal)): I am most impress when it makes big 
choices that feel ‘right’. 

e (Subject4, Mode: Reveal): The system is good at picking a specific 
thing and running with it. 

e (Subject4, Mode: Neutral): felt like I was improvising with an im- 

provisor who had their own ideas and doesn’t want to accomodate or 

listen. 


e (Subject4, Mode: Conceal): I loved the attention to detail. 


F.5.2 General Interaction Comments/ Assessment 


e When I gave things that were specific, it would give me specifics back. 
It gives you as much as you put in. It is as though you are improvising 
with yourself. 

e Sometimes there is too much information in the longer offers. 

e |It| responds and makes offers but they seldom have ‘conflict’, interesting 

but not ‘heightening’. 

e |It] is very comfortable narrating. 

e |It| doesn’t have memory, so it feels like I am following the scene. 


e |It| has adopted my style of speaking, and my linguistic choices. 








e |It| seems to enjoy providing names and backstory. 

e It doesn’t know the details I am not providing, it doesn’t know the details 
I am implying. 

e The offers that the system gives can further the scene. 

e It felt workshoppy, like a good improv tool to practice improv for new 


improvisors. 
213 


Sometimes it becomes a narrator, these moments are less fun for me as 
an improvisor. 

I don’t know if the system knows how long I want the scene to be. 
Speed helps in the system because then you are not judging it. 

The system is not distracted by cheap laughs and references, it stays 


focused on the topic, it makes you do good improvisation. 
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Appendix G 


Listing of Media Coverage 


There has been ongoing coverage of the development of the system over the 


course of the thesis. A complete list of coverage is included below: 


1. 


10. 


Jan 2019 Artificial intelligence takes over Edmonton improv show at 
Citadel. Global News Edmonton. 

Aug 2018 A Robot Walks Into a Bar. But Can It Do Comedy? Alex 
Marshall. New York Times. 

Sep 2018 So a Computer Walks Into a Bar... James Geary. Wall Street 
Journal. 

May 2018 Meet Blueberry: The Robot Designed to Make you Laugh 
in Hello World Canada: The Rise of AI. Ashlee Vance. Bloomberg. 
Sep 2018 Artificial intelligence put to the test during Edmonton Fringe 
Festival show. Emily Mertz. Global News Edmonton. 

Sep 2018 AI Attempts Improv Comedy And It Goes Hilariously Wrong. 
Rosie McCall. IFL Science. 

Sep 2018 AI Is Tackling Yet Another Creative Medium: Improv Com- 
edy. Emily Petsko. Mental Floss. 

Sep 2018 AI tries bad improv comedy to trick people into thinking it 
is human. Frank Swain. New Scientist. 

Sep 2018 Improbotics: Bringing machine intelligence into improvised 
theatre. Ingrid Fadelli. TechXplore. 

Aug 2018 Artificial intelligence leads to real comedy for PhD student. 
Jordan Mae Cook. UofA Folio. 
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11. 


12. 
13. 


14. 


15. 
16. 


17. 


18. 


19. 


20. 


21. 


22, 


Aug 2018 Man and machine meld as robots take stage in Fringe improv 
show. Stephen Cook. Edmonton Journal. 

Aug 2018 Siblings in science. Katie Willis. UofA Science Contours. 
Aug 2018 Artificial Intelligence is changing the world. Are we ready 
for it? Gayle MacDonald. Globe and Mail. 

Mar 2018 make ’em laugh. Q&A with Kory and Blueberry. New Trail, 
University of Alberta Alumni Magazine. 

Mar 2018 Teaching Machines to be Funny. Steve Paikin. TVOntario. 
Dec 2017 Robot’s terrible jokes are a new test of machine intelligence. 
Douglas Heaven. New Scientist. 

Dec 2017 This A.I. is Learning How to be Human by Doing Improv 
Comedy. Danny Paez. Inverse. 

Nov 2017 Will a Robot Steal my Job? Ann-Marie Tomchak. Animo 
TV, RTE One. 

Sep 2017 Music and Artificial Intelligence. Will Artificial Intelligence 
Elevate Pop Music or Pervert It? (The Answer is Yes). Aaron Carnes. 
Playboy. 

Mar 2017 Meet the smart robots with artificial irreverence. Oliver 
Moody. The Times of London. 

Feb 2017 Will machines end up being smarter than man? El cerebro 
artificial que piensa por ti. (Spanish). Joseba Elola. 1 pais. 

Dec 2016 Comedians to Watch in 2017: HumanMachine. Holly. The 


Phoenix Remix. 
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